Extracting XML nodes in Python

Question

This is part of an XML document that I have:

<tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>

I need to extract the href attribute "of the <a> element after the <td> element with content 'Image:'". The <a> element has no other id or class attributes that I can use.

Sorry if that sounds complicated

Thanks in advance!

Do you need help with the syntax of a specific parser, or are you trying to find an XML parser for Python? — Anthony
– Anthony, Commented Apr 7, 2011 at 21:29

Guard · Accepted Answer · 2011-04-07 22:12:52Z

2

OK, the final elegant (I hope ;) answer with a single XPath expression

from lxml import etree
root = etree.fromstring(your_text)
print root.xpath("//td[contains(text(), 'Image')]/following-sibling::td/a/@href")[0]

answered Apr 7, 2011 at 22:12

Guard

6,9855 gold badges43 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PEdroArthur · Accepted Answer · 2011-04-07 21:56:44Z

1

If your input file is just like your excerpt, the following code may help you:

from xml.dom.minidom import parseString

def tdlinks(xml):
    o = []
    l = parseString(xml).getElementsByTagName('td')
    while l != []:
        if l[0].firstChild.wholeText == unicode('Image:') and len(l) > 1:
            if l[1].getElementsByTagName('a') != []:
                o.append(l[1].getElementsByTagName('a')[0].getAttribute('href'))
                l.pop(1)
        l.pop(0)
    return o

Take a look at the minidom documentation. It may help you to improve the code if you find any anomaly during its execution.

answered Apr 7, 2011 at 21:56

PEdroArthur

9048 silver badges19 bronze badges

Comments

Guard · Accepted Answer · 2011-04-07 21:29:03Z

0

Use lxml http://lxml.de/xpathxslt.html

Your XPath would look like /tr/td[1]/a to get the element, then you can do el.attrib['href']

You can actually traverse the tree without XPath, but it's very powerful and useful tool

answered Apr 7, 2011 at 21:29

Guard

6,9855 gold badges43 silver badges58 bronze badges

3 Comments

Anthony Over a year ago

Based on his question, he would need to query the td after the td containing "Image:". I think there is a "next sibling" selector for Xpath, right?

Guard Over a year ago

yes, for sure (though I don't remember it :). may be if this is so complicated (the're multiple td's, only 1 contains 'Image:' text) the solution would be 1. select ALL td's, iterate, check their text().startswith('Image'), then select its next sibling

Guard Over a year ago

the selector is following-sibling (w3schools.com/XPath/xpath_axes.asp)

dting · Accepted Answer · 2011-04-07 22:41:32Z

0

from xml.dom import minidom

dom = minidom.parseString("""<tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>
""")

print dom.toxml() + "\n"

links = (a.attributes['href'].value for a in dom.getElementsByTagName('a') 
    if a.parentNode.nodeName == 'td' and a.parentNode.previousSibling.firstChild.data == 'Image:')

for link in links:
    print link

results in:

<?xml version="1.0" ?><tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>

http://live.astrometry.net/status.php?job=alpha-201104-6758393&get=fullsize.png

answered Apr 7, 2011 at 22:41

dting

39.4k10 gold badges98 silver badges117 bronze badges

Collectives™ on Stack Overflow

Extracting XML nodes in Python

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related