HTML parsing with pup

Question

I'm trying to parse an HTML page with pup. This is a command-line HTML parser and it accepts general HTML selectors. I know I can use Python which I do have installed on my machine, but I'd like to learn how to use pup just to get practice with the command-line.

The website I want to scrape from is https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1

I created an html file:

curl https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1 > fbi2018.html

How do I extract out a column of data, such as 'Population'?

This is the command I originally wrote:

cat fbi2018.html | grep -A1 'cell31 ' | grep -v 'cell31 ' | sed 's/text-align: right;//' | sed 's/<[/]td>//' | sed 's/--//' | sed '/^[[:space:]]*$/d' | sort -nk1,1

It actually works but it's an ugly, hacky way to do it, which is why I want to use pup. I noticed that all of the values I need from the column 'Population' have headers="cell 31 .." somewhere within the <td> tag. For example:

<td id="cell211" class="odd group1 valignmentbottom numbercell" rowspan="1" colspan="1" headers="cell31 cell210">
323,405,935</td>

I want to extract all the values that have this particular header in its <td> tag, which in this particular example, would be 323,405,935

It seems that multiple selectors in pup doesn't work, however. So far, I can select all the td elements:

cat fbi2018.html | pup 'td'

But I don't know how to select headers that contain a particular query.

EDIT: The output should be:

272,690,813
281,421,906
285,317,559
287,973,924
290,788,976
293,656,842
296,507,061
299,398,484
301,621,157
304,059,724
307,006,550
309,330,219
311,587,816
313,873,685
316,497,531
318,907,401
320,896,618
323,405,935
325,147,121
327,167,434

Side note, you just can grep it directly without piping it from cat. grep -A1 'cell31 ' fbi2018.html — annahri
– annahri, Commented May 30, 2020 at 5:35

annahri · Accepted Answer · 2021-07-16 07:50:35Z

7

TLDR

Use this if you want whole column under 'Population' of that table:

... | pup 'div#table-data-container:nth-of-type(3) td.group1 text{}'

Basic usage

pup does support multiple selectors. For example, if you want to grab wanted text!! below:

$ cat file.html
<div>
  <table>
    <tr class='class-a'>
       <td id='aaa'> some text </td>
       <td id='bbb'> some other text. </td>
    </tr>
    <tr class='class-b'>
       <td id='aaa'> wanted text!! </td>
       <td id='bbb'> some other text. </td>
    </tr>
  </table>
</div>

$ cat file.html | pup 'div table tr.class-b td#aaa'
<td id="aaa">
 wanted text!!
</td>

Then add text{} to get only the text:

$ cat file.html | pup 'div table tr.class-b td#aaa text{}'
 wanted text!!

So in your case it should be:

$ cat fbi2018.html | pup 'td#cell211 text{}'

323,405,935

Or better, you don't have to download the page, just pipe curl to pup

url="https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1"
curl -s "$url" | pup 'td#cell211 text{}'

Explanation

If you want values from an entire column, then you should know the characteristic of the element you wanted to scrape.

In this case 'Population' column from given link. On the page, there's 2 tables wrapped in <div id='table-data-container'>... If you use ... | pup 'div#table-data-container', it will also grab data from the second table. You don't want that.

How do pup know you want the first table? Well, there's another hint. As you can see, there's few <div>s. And your table is on 3rd div. So you can use CSS's psuedo-classes, in this case div#table-data-container:nth-of-type(3)

Then, the column has unique selector as td.group1

Combine them all then pipe it to grep -v -e '^$' to get rid of blank spaces.

... | pup 'div#table-data-container:nth-of-type(3) td.group1 text{}' | grep -v -e '^$'

and you will get what you wanted:

272,690,813
281,421,906
285,317,559
...
327,167,434

edited Jul 16, 2021 at 7:50

answered May 30, 2020 at 5:41

annahri

2,1281 gold badge20 silver badges35 bronze badges

Thanks! Now that I know how to extract a value out of one particular cell, how do I do that for the entire column under "Population"? Of course, I could do pup 'td#cell41 text{}', pup 'td#cell51 text{}', pup 'td#cell51 text{}', ... pup 'td#cell221 text{}', pup 'td#cell231 text{}' But I'd like to find a simpler way?

rplee
– rplee

2020-05-30 07:15:47 +00:00
Commented May 30, 2020 at 7:15
P.S. There are 20 cells in total under that column.

rplee
– rplee

2020-05-30 07:20:53 +00:00
Commented May 30, 2020 at 7:20
@rplee And kindly accept my answer if this does solve your problem.

annahri
– annahri

2020-05-30 07:56:17 +00:00
Commented May 30, 2020 at 7:56
1

Exactly what I was looking for, thanks. Answer accepted :)

rplee
– rplee

2020-05-30 13:07:25 +00:00
Commented May 30, 2020 at 13:07
Ok, I played around with this a little more. And I was able to get the data from the other columns as well, by just incrementing the group number (ie. ... | pup 'div#table-data-container:nth-of-type(3) td.group21 text{}' | grep -v -e '^$ for the column for 'Motor Vehicle Theft'.) However, how did you know 'Population' column was in the 3rd division? I tried to get the data under the 'Year' column, but I couldn't. I can see that 'Year' column is in group0, but it's not in the 3rd division because nothing prints out when I use that same pup command for group0. How do I get the 'Year' column?

rplee
– rplee

2020-05-31 13:18:31 +00:00
Commented May 31, 2020 at 13:18

| Show 9 more comments

bat · Accepted Answer · 2020-05-29 18:00:09Z

0

There are two problems here:
1) Parse the values from an HTML table
2) Do your desired operations (min, max, ect)

I don't think you will be able to do this in one line. I like the idea of converting the HTML table to a .csv and then operating on the CSV. You can use AWK for that but I'd use the Python library, Pandas, instead. Why write bash if you can avoid it?

I found a way to use bash to convert an HTML table to a .csv here

An example of using AWK to average a column is here

edited May 29, 2020 at 18:00

answered May 29, 2020 at 17:44

bat

1489 bronze badges

That's why OP is using pup. Just like jq but for HTML.

annahri
– annahri

2020-05-30 06:05:31 +00:00
Commented May 30, 2020 at 6:05
1

The question has been changed a few times since I answered it. Originally pup and jq were in the question title but not explicitly required for the question. Since the question changed, I would have deleted my answer if I could have. Thanks for your answer. I learned something new.

bat
– bat

2020-05-30 15:10:35 +00:00
Commented May 30, 2020 at 15:10

Add a comment |

Stack Exchange Network

HTML parsing with pup

2 Answers 2

TLDR

Basic usage

Explanation

You must log in to answer this question.

Hot Network Questions

HTML parsing with pup

2 Answers 2

TLDR

Basic usage

Explanation

You must log in to answer this question.

Related

Hot Network Questions