basic xml parsing with Python

Question

I am extracting image data from Flickr via their API and what I get printed is a few thousand xml objects that look like this:

<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />

Now I want to extract data for attributes 'lat' and 'long' in one run. And the data for the attribute 'url_n' in the other. How can I do that in Python? I have no experience with parsing xml data and don't know where to start.

Thanks a lot!

Look at etree or BeautifulSoup.

TheSoundDefense
– TheSoundDefense

2014-07-20 14:45:08 +00:00
Commented Jul 20, 2014 at 14:45 — TheSoundDefense
– TheSoundDefense, Commented Jul 20, 2014 at 14:45

Jan Vlcinsky · Accepted Answer · 2014-07-21 12:08:37Z

1

Use lxml

While there are multiple XML related packages in Python, incl. stdlib one, I prefer using lxml, as it offers all what I need (good XPath support, schema validation etc.) and I prefer to keep number of packages I use small.

For the xml documents from Flickr, the solution could look like

Script `flickr.py`

from lxml import etree
xmllines = """
<photo accuracy="15" context="0" dateupload="1398279194" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="320" id="13986079375" isfamily="0" isfriend="0" ispublic="1" latitude="41.828482" license="0" longitude="-87.624506" owner="100231432@N02" pathalias="perspectivesschools" place_id="cF8n.mJTWrhYf0uBEw" secret="f46eef0b1d" server="7308" title="Sean Gallagher, Pulitzer Photojournalist visits MSA" url_n="https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg" width_n="213" woeid="28297331" />
<photo accuracy="12" context="0" dateupload="1394558054" farm="4" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086071753" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="265103ac38" server="3040" title="" url_n="https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg" width_n="320" woeid="13978" />
<photo accuracy="12" context="0" dateupload="1394558019" farm="8" geo_is_contact="0" geo_is_family="0" geo_is_friend="0" geo_is_public="1" height_n="213" id="13086343854" isfamily="0" isfriend="0" ispublic="1" latitude="51.451914" license="2" longitude="-0.122882" owner="96189004@N04" pathalias="" place_id="JYdWRftQUbMvFA" secret="a6858f84d2" server="7451" title="" url_n="https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg" width_n="320" woeid="13978" />
"""

for line in xmllines.strip().splitlines():
    doc = etree.fromstring(line)
    urls = doc.xpath("/photo/@url_n")
    if urls:
        url = urls[0]
        print url
    else:
        print "---no attribute url_n was found---"

which would output:

$ python flickr.py
https://farm8.staticflickr.com/7308/13986079375_f46eef0b1d_n.jpg
https://farm4.staticflickr.com/3040/13086071753_265103ac38_n.jpg
https://farm8.staticflickr.com/7451/13086343854_a6858f84d2_n.jpg

edited Jul 21, 2014 at 12:08

answered Jul 20, 2014 at 16:33

Jan Vlcinsky

44.4k12 gold badges106 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

bcrvc Over a year ago

It works, thanks! I managed to solve the problem with minidom, but the code breaks for some reason when parsing larger dataset...

Jan Vlcinsky Over a year ago

@loop_digga You are welcome. I guess, the code which breaks is the minidom one, not the lxml. As minidom does all the work in memory, it is possible to have troubles with larger documents (even though your question shows a lot of very small XML documents). With lxml, there are few options how to process even large (even endless) documents using limited memory (with iterparse or even with SAX parsing).

bcrvc Over a year ago

Thanks for additional explanation! But actually, both codes break with a larger dataset. Tried yours now with a bigger data and after printing 648 URLs, the error occurs: 'IndexError: list index out of range'. Not sure what is happening.

Jan Vlcinsky Over a year ago

@loop_digga It is likely, the <photo ../> xml document number 649 has not "url_n" attribute. I have modified the answer to handle that.

bcrvc Over a year ago

You are right! And I obviously had the same problem in minidom code. Thanks!

Community · Accepted Answer · 2017-05-23 12:05:34Z

1

Parsing XML with regex is not a good idea. Try BeautifulSoup - it not only parses XML, but it also has functions to get the next/parent/etc element in relation to one selected and their attributes easily.

Example use:

from bs4 import BeautifulSoup
(...)
soup = BeautifulSoup(flickr_xml)
for photo in soup.find_all('photo'):
    print(photo.get('url_n'))

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered Jul 20, 2014 at 14:54

dwitvliet

7,7317 gold badges40 silver badges66 bronze badges

Collectives™ on Stack Overflow

basic xml parsing with Python

2 Answers 2

Use lxml

Script `flickr.py`

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Use lxml

Script flickr.py

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Script `flickr.py`