I'm trying to parse an HTML page with pup. This is a command-line HTML parser and it accepts general HTML selectors. I know I can use Python which I do have installed on my machine, but I'd like to learn how to use pup just to get practice with the command-line.
The website I want to scrape from is https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1
I created an html file:
curl https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/topic-pages/tables/table-1 > fbi2018.html
How do I extract out a column of data, such as 'Population'?
This is the command I originally wrote:
cat fbi2018.html | grep -A1 'cell31 ' | grep -v 'cell31 ' | sed 's/text-align: right;//' | sed 's/<[/]td>//' | sed 's/--//' | sed '/^[[:space:]]*$/d' | sort -nk1,1
It actually works but it's an ugly, hacky way to do it, which is why I want to use pup. I noticed that all of the values I need from the column 'Population' have headers="cell 31 .." somewhere within the <td> tag. For example:
<td id="cell211" class="odd group1 valignmentbottom numbercell" rowspan="1" colspan="1" headers="cell31 cell210">
323,405,935</td>
I want to extract all the values that have this particular header in its <td> tag, which in this particular example, would be 323,405,935
It seems that multiple selectors in pup doesn't work, however. So far, I can select all the td elements:
cat fbi2018.html | pup 'td'
But I don't know how to select headers that contain a particular query.
EDIT: The output should be:
272,690,813
281,421,906
285,317,559
287,973,924
290,788,976
293,656,842
296,507,061
299,398,484
301,621,157
304,059,724
307,006,550
309,330,219
311,587,816
313,873,685
316,497,531
318,907,401
320,896,618
323,405,935
325,147,121
327,167,434


grep -A1 'cell31 ' fbi2018.html