I am trying to read through an html doc using python and gather all of the table rows into a single list. (I am aware of specialized tools for this purpose, but I must use regex.) Here is my code so far:
import urllib
import re
URL = 'http://www.xpn.org/events/concert-calendar'
sock = urllib.urlopen( URL )
doc = sock.read()
sock.close()
patString = r'''
< tr(. * ?)>
(.*?)
< /tr>
'''
pattern = re.compile(patString, re.VERBOSE)
concerts = re.findall(pattern, doc)
print (concerts)
However, the print is only printing an empty list. I have tried a few different patterns but all have produced the same result. I'm pretty sure that the issue is the pattern, but I'm not entirely sure (as I am trying to become accommodated with python while writing this.) the table rows I am trying to find have the format <tr class="odd/even"> other data </tr> and I would like to capture all of this data and place it into a list for use later in the script.
Any help is appreciated. Thanks
I must use regex.really? i am curious as too why?. * ?literally parses any character, followed by an unlimited amount of space, followed by possibly one more space. Is that what you meant to type?pattern = re.compile(patString, re.VERBOSE|re.DOTALL)..*?, otherwise the matching could escape from the<>and match more than one tag at a time..*?is valid and correct. The spaces are acceptable and do not cause a problem here. They are ignored because the pattern is compiled with there.VERBOSEoption.