Python REST API Data Extraction with requests library

Ask Question

Asked 4 years, 9 months ago

Modified 3 years, 8 months ago

Viewed 288 times

Is there a better way to write this piece of code to extract data from a REST API with the requests library? It currently takes about 10 minutes to finish extracting the data from the REST API. I think the code below may not be the most efficient method.

import time
import requests
import datetime
import pandas as pd

def loan_rest_api():
    '''
    This function calls the rest api and stored the data in a pandas dataframe.
    '''

    
    search_start = time.time()
    print("Extracting loan data from Rest API...")

    # Declaring variables for the first page number and the list that contains the results for each page in the REST API.
    num = 1
    pages = []

    # Declaring variable to look up the correct table in the REST API.
    table = 'loan'

    # The 'loan' table has about 50 columns, but these are the ones I need.
    # Declaring columns variable to choose the columns I'm interested in.
    columns = '''    
                loan_id,
                loan_name,
                loan_parent_id,
                loan_status,
                loan_type,
                loan_classification,
                loan_create_date,
                loan_last_update,
                loan_amount,
                loan_amount_last_update,
                loan_balance,
                loan_balance_last_update,
                loan_interest,
                loan_interest_last_update,
                city,
                office_location,
                project_name          
            '''

    # Infinitely loop needed to iterate through all of the pages of data that the loan
    # table contains. Since I don't know how many pages there are, I loop
    # through the entire table until it reaches the end. Once
    # it reaches the end, the While True condition no longer holds so the loop stops.
    while True:

        # Since the loan table in the rest api contains several pages, I had to implement a 
        # pagination functionality to prevent the function from either timing out or crashing.
        pagination = {'pageSize': '5000',
                      'pageNumber': str(num),
                      'attributes': columns.replace(' ', '').replace('\n', '')}

        # Send a GET request to the REST API on the corresponding loan table.
        # the params parameter contains the pagination requirements. Start 
        # with page 1, then page 2, etc. For each resulting data pull, append
        # the page to the pages list.
        # num increase by 1 on the next loop until it reaches the end.
        response = requests.get(REST_API_LINK + table, params=pagination)
        if response.status_code != 200:
            break
        results = response.json()
        pages.append(pd.DataFrame(results))
        num += 1

    # Create a pandas dataframe. This contains about 200,000 rows and it's
    # 40MB in size when exported as a csv file.
    final_df = pd.concat(pages, ignore_index=True)

    # Right now this takes about 800 seconds.
    search_end = time.time()
    print(f" ...search completed in {search_end - search_start: .2f} seconds.")

    return print('Search Done.')


```

edited Mar 22, 2022 at 14:56

Reinderien

71.2k5 gold badges76 silver badges257 bronze badges

asked Feb 8, 2021 at 21:20

Benb27

212 bronze badges

\$\begingroup\$ Yes, I'm on it. Sorry about that. \$\endgroup\$

Benb27
– Benb27

2021-02-08 21:48:09 +00:00
Commented Feb 8, 2021 at 21:48
\$\begingroup\$ Hi @Peilonrayz , I included descriptions on the code. I can elaborate on specific areas if needed. \$\endgroup\$

Benb27
– Benb27

2021-02-08 22:03:25 +00:00
Commented Feb 8, 2021 at 22:03
1

\$\begingroup\$ Seems the most time consumed action here is downloading the json data from server. To speed up, maybe you can download multiple parts concurrency, and or increase pageSize (if the server not configured to limit request frequency). \$\endgroup\$

tsh
– tsh

2021-02-09 01:44:49 +00:00
Commented Feb 9, 2021 at 1:44
\$\begingroup\$ Hi @tsh , I increased the pageSize from 5000 to 8000 and the extraction time improved by 6.5% going from 800 seconds to 750 seconds. At least it's something! I'll keep tinkering with the pagination parameter. All other tables in the REST API take about 1, 2, or 3 seconds. It's this Loan table with several pages of JSON data that take a while to extract. \$\endgroup\$

Benb27
– Benb27

2021-02-10 19:33:54 +00:00
Commented Feb 10, 2021 at 19:33
\$\begingroup\$ pageSize is now 15000 and time decreased to 600 seconds! Thank you @tsh! \$\endgroup\$

Benb27
– Benb27

2021-02-10 19:44:17 +00:00
Commented Feb 10, 2021 at 19:44

| Show 1 more comment

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Python REST API Data Extraction with requests library

0

You must log in to answer this question.

Hot Network Questions

Python REST API Data Extraction with requests library

0

You must log in to answer this question.

Related

Hot Network Questions