How to parse a complex text file using Python string methods or regex and export into tabular form

Question

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)

I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...

I am supposed to turn my text file from this

https://pastebin.com/ZM8EPu0p

and export it into a more readable format like this- example output is below

Here is what I have so far.

def readFile(court):
    csv_rows = []
    # read and split txt file into pages & chunks of data by pagragraph
    with open(court, "r") as file:
        data_chunks = file.read().split("\n\n")

        for chunk in data_chunks:
            chunk = chunk.strip  # .strip removes useless spaces
            if str(data_chunks[:4]).isnumeric():  # if first 4 characters are digits
                entry = None  # initialize an empty dictionary
            elif (
                str(data_chunks).isspace() and entry
            ):  # if we're on an empty line and the entry dict is not empty
                csv_rows.DictWriter(dialect="excel")  # turn csv_rows into needed output
                entry = {}
            else:

                # parse here?

                print(data_chunks)

    return csv_rows

readFile("/Users/mia/Desktop/School/programming/court.txt")

Please move all your code in a source block, it's hard to read. — Florin C.
– Florin C., Commented Nov 29, 2021 at 9:41

Florin C. · Accepted Answer · 2021-11-29 09:55:32Z

1

It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks. First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp

Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).

And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".

And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

answered Nov 29, 2021 at 9:55

Florin C.

6325 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

BdR Over a year ago

Good answer, and for parsing the data. I think you can use the fact that AKA:, TB:, CMPL:, CLS: etc. all end with : so maybe it's possible to use regex to find the keywords and values, if not then try to go through it algorithmically somehow.

Mia Over a year ago

Thank you for this! To split my file by page could I just use pages = file.split('\f') under my csv_rows = [] line, then under the else section begin to parse under the else: section line by line using regex?

Florin C. Over a year ago

Yes, something in that direction anyway. In this kind of work it is much easier to try things, debugging or even temporary prints of some variables is essential at the start. But after a while it also helps reading out documentation and tutorial pages.

Collectives™ on Stack Overflow

How to parse a complex text file using Python string methods or regex and export into tabular form

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related