2

I have the following huge input file (from stackexchange dataset):

 <row Id="659890" PostTypeId="2" ParentId="655986" CreationDate="2009-03-18T20:06:33.720" />
 <row Id="659891" PostTypeId="2" ParentId="659089" CreationDate="2009-03-18T20:07:44.843" /> 

Usually, the way I process a file is by reading line by line:

f = open( "file.txt", "r" )
for line in f:
   print line

However, for this case I would like to process it post by post. How can I do this?

Moreover, I want to be able to extract the value of PostTypeId and save it in a variable (I want to do the same for the other values as well).

So my question is: What is the most efficient way to do this assuming that the dataset can be really huge?

4
  • 3
    what did you try so far? Commented Oct 15, 2014 at 20:35
  • take a look at lxml. codereview.stackexchange.com/questions/2449/… Commented Oct 15, 2014 at 20:39
  • 1
    To add a note to this, parsing the data dump files (especially for Stack Overflow) can easily exceed system memory limits if not done appropriately. That should be an important consideration in any responses to this question. Commented Oct 15, 2014 at 20:46
  • I was trying to do this manually, by reading line by line and by appending in a local string variable each line until the line was ending with "/>". After that I was trying to extract the values by reading the string word by word and printing the content after the proper tag (for example after PostTypeId=" and before " character). Then I was re-initialising the string and do the same process for the next lines. I know that this is quite stupid approach and more time consuming but I guess that this will work well for big files as I am reading the file line by line (not 100% sure). Commented Oct 15, 2014 at 21:24

2 Answers 2

1

You can use xml.etree.ElementTree

import xml.etree.ElementTree as ET
tree = ET.parse(source)
root = tree.getroot()
# Look at each element that has 'row' tag
for row in root.iter('row'):
    print row.get('PostTypeId')

EDIT for junk after document

with open(someFile, 'r') as data:
    xmlData = '<rows>' + data.read() + '</rows>'
rows = ET.fromstring(xmlData)
for row in rows:
    print row.get('PostTypeId')
Sign up to request clarification or add additional context in comments.

2 Comments

This worked! However, I wanted to point out that I needed to wrap the data with root tags <rows> <row> ... </> <row> ... </> </rows> as "a junk after document element" error appears (like here stackoverflow.com/questions/2574894/…).
Ah, was unaware of the junk after document problem. I edited my answer to include that. Usually when I use ElementTree it's from an xml file so I don't have non wrapped data
1

if you ensure the <tag /> is on each line, and put memory into consideration, this may just work efficiently for you:

from xml.etree import ElementTree as ET

with open('yourfile', 'r') as f:
    # file is already a generator of lines
    for line in f:
        # use fromstring so you don't even need to wrap with another tag
        tree = ET.fromstring(line)
        # attrib will return all you need in a dict {key:value}
        # you may store this dict, append to a list, write to a file or even database
        print tree.attrib

results from your sample:

{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:06:33.720', 'Id': '659890', 'ParentId': '655986'}
{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:07:44.843', 'Id': '659891', 'ParentId': '659089'}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.