Parse HTML Using AWK

Question

I have the following HTML strcuture and want to extract data from it using the awk.

<body>
<div>...</div>
<div>...</div>
<div class="body-content">
    <div>...</div>
    <div class="product-list" class="container">
        <div class="w3-row" id="product-list-row">
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product A</div>
                    <div class="product-price">100,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product B</div>
                    <div class="product-price">200,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product C</div>
                    <div class="product-price">300,56</div>
                </div>
            </div>
            <div class="w3-col m2 s4">
                <div class="product-cell">
                    <div class="product-title">Product D</div>
                    <div class="product-price">400,56</div>
                </div>
            </div>
        </div>
    </div>
</div>
</body>

The result I want to have is as follows.

I was experimenting with the following awk script (I know it makes no sense to select product-price twice, I was about to modify this script)

awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/  { found=1 }'

but it gives me the result

100,56                </div>
200,56                </div>
300,56                </div>
400,56                </div>

I never used awk before, so can't just figure out what is wrong here or how to modify the above code. How would you do this?

Can you use a tool that understands xml instead, e.g. xmlstarlet? — Ed Morton
– Ed Morton, Commented Jun 27, 2021 at 17:34
Awk is a great tool for many sorts of text searching, but it is not well-suited for hierarchical structures like HTML. You'd be much better off with a tool designed for the job. @Ed Morton's suggestion xmlstarlet is a fine choice for use from the shell. Alternatively, if you know any scripting languages (e.g. Perl, Python, Ruby, Javascript, ..) most of them have installable libraries for HTML parsing. — Mark Reed
– Mark Reed, Commented Jun 27, 2021 at 17:55
Actually, GNU awk has an XML library too - see gawkextlib.sourceforge.net/xml/xml.html. — Ed Morton
– Ed Morton, Commented Jun 27, 2021 at 17:57
@EdMorton true, though last I checked installing gawk add-ons was not as straightforward as using cpanm, pip, gem, npm, etc. — Mark Reed
– Mark Reed, Commented Jun 28, 2021 at 4:37

RavinderSingh13 · Accepted Answer · 2021-06-27 18:04:10Z

3

With your shown samples/attempts, please try following awk code.

awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file

Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.

awk -F"[><]" '      ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")}     ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
                    ##from space followed by <div class=product-price"> till div close tag.
  print $3          ##printing 3rd column here.
}
' Input_file        ##Mentioning Input_file name here.

Changed regex to /^[ \t]+<div[ \t]+class as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.

edited Jun 27, 2021 at 18:04

answered Jun 27, 2021 at 17:34

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Said Savci Over a year ago

@RavinderSingh13, nice catch! The file contains control M characters.

Ed Morton Over a year ago

control-Ms in the input would not cause Ravinders original script to produce no output, it'd work just fine either way since it's not doing anything with the char at the end of each line.

Ed Morton Over a year ago

Reading the tea leaves - control-Ms are not the problem.

Ed Morton Over a year ago

@Javiator If you didn't make a mistake copy/pasting the script and your real input does look like the example you provided then my best guess is either a) that's not a blank after div or, more likely b) you're using an awk that doesn't understand character classes. Try changing /^[[:space:]]+<div class to /^[ \t]+<div[ \t]+class.

Said Savci Over a year ago

@EdMorton, by changing /^[[:space:]]+<div class to /^[ \t]+<div[ \t]+class I now get the desired output! Thank you!

|

Ed Morton · Accepted Answer · 2021-06-27 17:42:41Z

3

The result of a quick google for xmlstarlet print div contents and then a few secs of trial and error:

$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56

For an explanation - ask google :-).

edited Jun 27, 2021 at 17:42

answered Jun 27, 2021 at 17:37

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

2 Comments

Said Savci Over a year ago

I just installed xmlstarlet and tried to test it, but unfortunately the server gives me an HTML that is not well-formed. But I'll still upvote your answer!

Ed Morton Over a year ago

That's far more likely to be a problem for an awk script than an XML-aware tool. That's WHY you should use an XML-aware tool.

RavinderSingh13 · Accepted Answer · 2021-06-27 18:28:59Z

If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.

#!/bin/python3
##import library here.  
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
    contents = f.read()
    f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
    print (val.text)

NOTE:

One should have BeautifulSoup installed in Python along with install lxml with pip3 or pip depending upon your system.
Where Input_file is one where program is reading your all data.

Daweo · Accepted Answer · 2021-06-28 07:52:37Z

2

How would you do this?

If possible use tool designed for dealing with HTML, which GNU AWK is not.

If you are allowed to install then use hxselect it does process standard input and understand (subset) of CSS selectors, so in this case something like:

echo file.html | hxselect -i -c -s '\n' div.product-price

should give you desired result (disclaimer: I do not have ability to test it)

answered Jun 28, 2021 at 7:52

Daweo

38.2k3 gold badges18 silver badges34 bronze badges

Comments

Reino · Accepted Answer · 2021-07-02 12:34:53Z

2

It baffles me that time and time again people try to parse HTML, not with an HTML parser, but with a tool that doesn't understand HTML at all in general and with RegEx in particular!
With an HTML parser like xidel it's as simple as:

$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'

answered Jul 2, 2021 at 12:34

Reino

3,4801 gold badge17 silver badges24 bronze badges

Collectives™ on Stack Overflow

Parse HTML Using AWK

5 Answers 5

13 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

13 Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related