Using regex in python for html tags

Question

I am trying to read through an html doc using python and gather all of the table rows into a single list. (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. I have tried a few different patterns but all have produced the same result. I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script.

Any help is appreciated. Thanks

. * ? literally parses any character, followed by an unlimited amount of space, followed by possibly one more space. Is that what you meant to type? — le3th4x0rbot
– le3th4x0rbot, Commented May 9, 2014 at 16:59
Newlines in your data? Try pattern = re.compile(patString, re.VERBOSE|re.DOTALL). — Steven Rumbalski
– Steven Rumbalski, Commented May 9, 2014 at 17:01
You probably need .*?, otherwise the matching could escape from the <> and match more than one tag at a time. — le3th4x0rbot
– le3th4x0rbot, Commented May 9, 2014 at 17:27
@njzk2: Your latest comment is completely incorrect. .*? is valid and correct. The spaces are acceptable and do not cause a problem here. They are ignored because the pattern is compiled with the re.VERBOSE option. — Steven Rumbalski
– Steven Rumbalski, Commented May 9, 2014 at 18:45

le3th4x0rbot · Accepted Answer · 2014-05-09 17:39:21Z

3

This matches your sample data just fine. If the data runs on multiple lines, turn on the option for . to match \n. That option is re.DOTALL by the way.

<tr(.*?)>(.*?)</tr>

The ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part.

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case.

Things will get ugly if you have a <tr> in a <tr> for example.

edited May 9, 2014 at 17:39

answered May 9, 2014 at 17:31

le3th4x0rbot

2,46926 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

njzk2 Over a year ago

I was going for re.MULTILINE, but re.DOTALL is much better in this case

Aaron C Over a year ago

This worked great. Thanks for the help. Accepting this answer.

Aaron C Over a year ago

Also it is worth noting that i used <tr([^>] *?) > for the first tag.

le3th4x0rbot Over a year ago

@AaronC Excluding the > character from the match is a pretty good idea actually... a bit more explicit than the non-greedy match. The whole re.VERBOSE thing seems kind of weird to me, because everywhere else, pretty much always, white space matters in a regex. If you are learning the language for the first time, do yourself a favor and turn it off. [^>] * is truly difficult for my mind to accept as right, even though it works with the ignored white space.

Collectives™ on Stack Overflow

Using regex in python for html tags

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related