2

This is part of an XML document that I have:

<tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>

I need to extract the href attribute "of the <a> element after the <td> element with content 'Image:'". The <a> element has no other id or class attributes that I can use.

Sorry if that sounds complicated

Thanks in advance!

1
  • 1
    Do you need help with the syntax of a specific parser, or are you trying to find an XML parser for Python? Commented Apr 7, 2011 at 21:29

4 Answers 4

2

OK, the final elegant (I hope ;) answer with a single XPath expression

from lxml import etree
root = etree.fromstring(your_text)
print root.xpath("//td[contains(text(), 'Image')]/following-sibling::td/a/@href")[0]
Sign up to request clarification or add additional context in comments.

Comments

1

If your input file is just like your excerpt, the following code may help you:

from xml.dom.minidom import parseString

def tdlinks(xml):
    o = []
    l = parseString(xml).getElementsByTagName('td')
    while l != []:
        if l[0].firstChild.wholeText == unicode('Image:') and len(l) > 1:
            if l[1].getElementsByTagName('a') != []:
                o.append(l[1].getElementsByTagName('a')[0].getAttribute('href'))
                l.pop(1)
        l.pop(0)
    return o

Take a look at the minidom documentation. It may help you to improve the code if you find any anomaly during its execution.

Comments

0

Use lxml http://lxml.de/xpathxslt.html

Your XPath would look like /tr/td[1]/a to get the element, then you can do el.attrib['href']

You can actually traverse the tree without XPath, but it's very powerful and useful tool

3 Comments

Based on his question, he would need to query the td after the td containing "Image:". I think there is a "next sibling" selector for Xpath, right?
yes, for sure (though I don't remember it :). may be if this is so complicated (the're multiple td's, only 1 contains 'Image:' text) the solution would be 1. select ALL td's, iterate, check their text().startswith('Image'), then select its next sibling
the selector is following-sibling (w3schools.com/XPath/xpath_axes.asp)
0
from xml.dom import minidom

dom = minidom.parseString("""<tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>
""")

print dom.toxml() + "\n"

links = (a.attributes['href'].value for a in dom.getElementsByTagName('a') 
    if a.parentNode.nodeName == 'td' and a.parentNode.previousSibling.firstChild.data == 'Image:')

for link in links:
    print link

results in:

<?xml version="1.0" ?><tr><td>Image:</td><td>
<a href="http://live.astrometry.net/status.php?job=alpha-201104-6758393&amp;get=fullsize.png">fullsize.png</a></td></tr>

http://live.astrometry.net/status.php?job=alpha-201104-6758393&get=fullsize.png

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.