2

I have the following HTML strcuture and want to extract data from it using the awk.

<body>
<div>...</div>
<div>...</div>
<div class="body-content">
    <div>...</div>
    <div class="product-list" class="container">
        <div class="w3-row" id="product-list-row">
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product A</div>
                    <div class="product-price">100,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product B</div>
                    <div class="product-price">200,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product C</div>
                    <div class="product-price">300,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product D</div>
                    <div class="product-price">400,56</div>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

The result I want to have is as follows.

100,56
200,56
300,56
400,56

I was experimenting with the following awk script (I know it makes no sense to select product-price twice, I was about to modify this script)

awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/  { found=1 }'

but it gives me the result

100,56                </div>
200,56                </div>
300,56                </div>
400,56                </div>

I never used awk before, so can't just figure out what is wrong here or how to modify the above code. How would you do this?

5
  • 4
    Can you use a tool that understands xml instead, e.g. xmlstarlet? Commented Jun 27, 2021 at 17:34
  • Awk is a great tool for many sorts of text searching, but it is not well-suited for hierarchical structures like HTML. You'd be much better off with a tool designed for the job. @Ed Morton's suggestion xmlstarlet is a fine choice for use from the shell. Alternatively, if you know any scripting languages (e.g. Perl, Python, Ruby, Javascript, ..) most of them have installable libraries for HTML parsing. Commented Jun 27, 2021 at 17:55
  • 3
    Actually, GNU awk has an XML library too - see gawkextlib.sourceforge.net/xml/xml.html. Commented Jun 27, 2021 at 17:57
  • See also: stackoverflow.com/a/1732454/7552 Commented Jun 27, 2021 at 20:37
  • @EdMorton true, though last I checked installing gawk add-ons was not as straightforward as using cpanm, pip, gem, npm, etc. Commented Jun 28, 2021 at 4:37

5 Answers 5

3

With your shown samples/attempts, please try following awk code.

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")}     ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
                    ##from space followed by <div class=product-price"> till div close tag.
  print $3          ##printing 3rd column here.
}
' Input_file        ##Mentioning Input_file name here.

Changed regex to /^[ \t]+<div[ \t]+class as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.

Sign up to request clarification or add additional context in comments.

13 Comments

@RavinderSingh13, nice catch! The file contains control M characters.
control-Ms in the input would not cause Ravinders original script to produce no output, it'd work just fine either way since it's not doing anything with the char at the end of each line.
Reading the tea leaves - control-Ms are not the problem.
@Javiator If you didn't make a mistake copy/pasting the script and your real input does look like the example you provided then my best guess is either a) that's not a blank after div or, more likely b) you're using an awk that doesn't understand character classes. Try changing /^[[:space:]]+<div class to /^[ \t]+<div[ \t]+class.
@EdMorton, by changing /^[[:space:]]+<div class to /^[ \t]+<div[ \t]+class I now get the desired output! Thank you!
|
3

The result of a quick google for xmlstarlet print div contents and then a few secs of trial and error:

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56

For an explanation - ask google :-).

2 Comments

I just installed xmlstarlet and tried to test it, but unfortunately the server gives me an HTML that is not well-formed. But I'll still upvote your answer!
That's far more likely to be a problem for an awk script than an XML-aware tool. That's WHY you should use an XML-aware tool.
3

If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.

#!/bin/python3
##import library here.  
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
    contents = f.read()
    f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
    print (val.text)

NOTE:

  • One should have BeautifulSoup installed in Python along with install lxml with pip3 or pip depending upon your system.
  • Where Input_file is one where program is reading your all data.

Comments

2

How would you do this?

If possible use tool designed for dealing with HTML, which GNU AWK is not.

If you are allowed to install then use hxselect it does process standard input and understand (subset) of CSS selectors, so in this case something like:

echo file.html | hxselect -i -c -s '\n' div.product-price

should give you desired result (disclaimer: I do not have ability to test it)

Comments

2

It baffles me that time and time again people try to parse HTML, not with an HTML parser, but with a tool that doesn't understand HTML at all in general and with RegEx in particular!
With an HTML parser like it's as simple as:

$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.