0

I am trying to read through an html doc using python and gather all of the table rows into a single list. (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:

import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
    < tr(. * ?)>
    (.*?)
    < /tr>
    '''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)

However, the print is only printing an empty list. I have tried a few different patterns but all have produced the same result. I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script.

Any help is appreciated. Thanks

15
  • 5
    I must use regex. really? i am curious as too why? Commented May 9, 2014 at 16:55
  • . * ? literally parses any character, followed by an unlimited amount of space, followed by possibly one more space. Is that what you meant to type? Commented May 9, 2014 at 16:59
  • 1
    Newlines in your data? Try pattern = re.compile(patString, re.VERBOSE|re.DOTALL). Commented May 9, 2014 at 17:01
  • 1
    You probably need .*?, otherwise the matching could escape from the <> and match more than one tag at a time. Commented May 9, 2014 at 17:27
  • 1
    @njzk2: Your latest comment is completely incorrect. .*? is valid and correct. The spaces are acceptable and do not cause a problem here. They are ignored because the pattern is compiled with the re.VERBOSE option. Commented May 9, 2014 at 18:45

1 Answer 1

3

This matches your sample data just fine. If the data runs on multiple lines, turn on the option for . to match \n. That option is re.DOTALL by the way.

<tr(.*?)>(.*?)</tr>

The ? qualification for the data in the middle is pretty important, otherwise it could match entire <tr></tr> blocks as the data part.

It is easy because you are not parsing HTML, but instead just trying to extract some tags in a very specific case.

Things will get ugly if you have a <tr> in a <tr> for example.

Sign up to request clarification or add additional context in comments.

4 Comments

I was going for re.MULTILINE, but re.DOTALL is much better in this case
This worked great. Thanks for the help. Accepting this answer.
Also it is worth noting that i used <tr([^>] *?) > for the first tag.
@AaronC Excluding the > character from the match is a pretty good idea actually... a bit more explicit than the non-greedy match. The whole re.VERBOSE thing seems kind of weird to me, because everywhere else, pretty much always, white space matters in a regex. If you are learning the language for the first time, do yourself a favor and turn it off. [^>] * is truly difficult for my mind to accept as right, even though it works with the ignored white space.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.