How to parse/extract specific values from an input huge file in python?

Question

I have the following huge input file (from stackexchange dataset):

 <row Id="659890" PostTypeId="2" ParentId="655986" CreationDate="2009-03-18T20:06:33.720" />
 <row Id="659891" PostTypeId="2" ParentId="659089" CreationDate="2009-03-18T20:07:44.843" />

Usually, the way I process a file is by reading line by line:

f = open( "file.txt", "r" )
for line in f:
   print line

However, for this case I would like to process it post by post. How can I do this?

Moreover, I want to be able to extract the value of PostTypeId and save it in a variable (I want to do the same for the other values as well).

So my question is: What is the most efficient way to do this assuming that the dataset can be really huge?

take a look at lxml. codereview.stackexchange.com/questions/2449/… — monkut
– monkut, Commented Oct 15, 2014 at 20:39
To add a note to this, parsing the data dump files (especially for Stack Overflow) can easily exceed system memory limits if not done appropriately. That should be an important consideration in any responses to this question. — Andy
– Andy ♦, Commented Oct 15, 2014 at 20:46
I was trying to do this manually, by reading line by line and by appending in a local string variable each line until the line was ending with "/>". After that I was trying to extract the values by reading the string word by word and printing the content after the proper tag (for example after PostTypeId=" and before " character). Then I was re-initialising the string and do the same process for the next lines. I know that this is quite stupid approach and more time consuming but I guess that this will work well for big files as I am reading the file line by line (not 100% sure). — Mike B
– Mike B, Commented Oct 15, 2014 at 21:24

Chrispresso · Accepted Answer · 2014-10-15 21:24:55Z

1

You can use xml.etree.ElementTree

import xml.etree.ElementTree as ET
tree = ET.parse(source)
root = tree.getroot()
# Look at each element that has 'row' tag
for row in root.iter('row'):
    print row.get('PostTypeId')

EDIT for junk after document

with open(someFile, 'r') as data:
    xmlData = '<rows>' + data.read() + '</rows>'
rows = ET.fromstring(xmlData)
for row in rows:
    print row.get('PostTypeId')

edited Oct 15, 2014 at 21:24

answered Oct 15, 2014 at 21:02

Chrispresso

4,3382 gold badges22 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mike B Over a year ago

This worked! However, I wanted to point out that I needed to wrap the data with root tags <rows> <row> ... </> <row> ... </> </rows> as "a junk after document element" error appears (like here stackoverflow.com/questions/2574894/…).

Chrispresso Over a year ago

Ah, was unaware of the junk after document problem. I edited my answer to include that. Usually when I use ElementTree it's from an xml file so I don't have non wrapped data

Anzel · Accepted Answer · 2014-10-16 00:25:44Z

1

if you ensure the <tag /> is on each line, and put memory into consideration, this may just work efficiently for you:

from xml.etree import ElementTree as ET

with open('yourfile', 'r') as f:
    # file is already a generator of lines
    for line in f:
        # use fromstring so you don't even need to wrap with another tag
        tree = ET.fromstring(line)
        # attrib will return all you need in a dict {key:value}
        # you may store this dict, append to a list, write to a file or even database
        print tree.attrib

results from your sample:

{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:06:33.720', 'Id': '659890', 'ParentId': '655986'}
{'PostTypeId': '2', 'CreationDate': '2009-03-18T20:07:44.843', 'Id': '659891', 'ParentId': '659089'}

edited Oct 16, 2014 at 0:25

answered Oct 16, 2014 at 0:03

Anzel

20.6k5 gold badges54 silver badges53 bronze badges

Collectives™ on Stack Overflow

How to parse/extract specific values from an input huge file in python?

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related