0

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)

I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...

I am supposed to turn my text file from this

https://pastebin.com/ZM8EPu0p

and export it into a more readable format like this- example output is below

Here is what I have so far.

def readFile(court):
    csv_rows = []
    # read and split txt file into pages & chunks of data by pagragraph
    with open(court, "r") as file:
        data_chunks = file.read().split("\n\n")

        for chunk in data_chunks:
            chunk = chunk.strip  # .strip removes useless spaces
            if str(data_chunks[:4]).isnumeric():  # if first 4 characters are digits
                entry = None  # initialize an empty dictionary
            elif (
                str(data_chunks).isspace() and entry
            ):  # if we're on an empty line and the entry dict is not empty
                csv_rows.DictWriter(dialect="excel")  # turn csv_rows into needed output
                entry = {}
            else:

                # parse here?

                print(data_chunks)

    return csv_rows

readFile("/Users/mia/Desktop/School/programming/court.txt")
1
  • Please move all your code in a source block, it's hard to read. Commented Nov 29, 2021 at 9:41

1 Answer 1

1

It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks. First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp

Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).

And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".

And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

Sign up to request clarification or add additional context in comments.

3 Comments

Good answer, and for parsing the data. I think you can use the fact that AKA:, TB:, CMPL:, CLS: etc. all end with : so maybe it's possible to use regex to find the keywords and values, if not then try to go through it algorithmically somehow.
Thank you for this! To split my file by page could I just use pages = file.split('\f') under my csv_rows = [] line, then under the else section begin to parse under the else: section line by line using regex?
Yes, something in that direction anyway. In this kind of work it is much easier to try things, debugging or even temporary prints of some variables is essential at the start. But after a while it also helps reading out documentation and tutorial pages.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.