Web Scraping DOL with Selenium

Question

This is my code to extract data from the DOL website for a project I'm doing for my portfolio. The code does extract all of the data that I need, however it runs really slowly. I think it took like 15 minutes for it to finish extracting everything. I would really appreciate if someone could just point me in the right direction.

import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.action_chains import ActionChains

url = 'https://oui.doleta.gov/unemploy/claims.asp'
driver = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\chromedriver.exe")

driver.implicitly_wait(10)
driver.get(url)
driver.find_element_by_css_selector('input[name="level"][value="state"]').click()
Select(driver.find_element_by_name('strtdate')).select_by_value('2020')
Select(driver.find_element_by_name('enddate')).select_by_value('2021')
driver.find_element_by_css_selector('input[name="filetype"][value="html"]').click()
select = Select(driver.find_element_by_id('states'))

for opt in select.options:
    opt.click()
input('Press ENTER to submit the form')
driver.find_element_by_css_selector('input[name="submit"][value="Submit"]').click()


rows = driver.find_elements_by_xpath('//*[@id="content"]/table/tbody')

unemployment = []  
# Iterate over the rows
        
# Loop through tables from url with help of BeautifulSoup
for table in rows:
    headers = []   
    for head in table.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr[2]/th'):
        headers.append(head.text)
    values = []
    for row in table.find_elements_by_xpath('//*[@id="content"]/table/tbody/tr')[2:]:
        values = []
        for col in row.find_elements_by_xpath("./*[name()='th' or name()='td']"):
            values.append(col.text)
        if values:
            unemp_dict = {headers[i]: values[i] for i in 
            range(len(headers))}
            unemployment.append(unemp_dict)
    

input('Press ENTER to close the automated browser')
driver.quit()
```

Setris · Accepted Answer · 2021-11-21 10:10:44Z

To keep the review short, if you can get the data from an API instead of scraping, that's always going to be the faster option.

Luckily, in this case there is such an API. Open the site with Chrome developer tools and request data for an arbitrary state. You will see that the website calls POST https://oui.doleta.gov/unemploy/wkclaims/report.asp to retrieve the data you're interested in.

The following code snippet shows the API call and parsing of the response.

#!/usr/bin/env python3

import requests
import xml.etree.ElementTree as ET

"""
Example XML response returned from API:

<r539cyState>
    <week>
        <stateName>California</stateName>
        <weekEnded>10/17/2020</weekEnded>
        <InitialClaims>159,876</InitialClaims>
        <ReflectingWeekEnded>10/10/2020</ReflectingWeekEnded>
        <ContinuedClaims>1,836,240</ContinuedClaims>
        <CoveredEmployment>17,473,068</CoveredEmployment>
        <InsuredUnemploymentRate>10.51</InsuredUnemploymentRate>
    </week>
    ...
</r539cyState>
"""


def main():
    # map of XML tag names to table column names
    tag_to_column_names = {
        "stateName": "State",
        "weekEnded": "Filed week ended",
        "InitialClaims": "Initial Claims",
        "ReflectingWeekEnded": "Reflecting Week Ended",
        "ContinuedClaims": "Continued Claims",
        "CoveredEmployment": "Covered Employment",
        "InsuredUnemploymentRate": "Insured Unemployment Rate",
    }

    data = {
        "level": "state",
        "final_yr": 2022,
        "strtdate": 2020,
        "enddate": 2021,
        "filetype": "xml",
        "states[]": "CA",
        "submit": "Submit",
    }

    with requests.Session() as s:
        r = s.post(
            "https://oui.doleta.gov/unemploy/wkclaims/report.asp", data=data
        )
        root = ET.fromstring(r.text)

        # iterating over <week> elements
        for week in root:
            for tag_name, column_name in tag_to_column_names.items():
                print(
                    f"{column_name} ({tag_name}): {week.find(tag_name).text}"
                )
            print("\n")


if __name__ == "__main__":
    main()

Output:

State (stateName): California
Filed week ended (weekEnded): 01/04/2020
Initial Claims (InitialClaims): 36,720
Reflecting Week Ended (ReflectingWeekEnded): 12/28/2019
Continued Claims (ContinuedClaims): 359,606
Covered Employment (CoveredEmployment): 17,203,829
Insured Unemployment Rate (InsuredUnemploymentRate): 2.09

...

You can extend this example by iterating over all the states and making a request for each one. And if I were you I would just hard code the states in the script since there are only 53 of them (53 because the site's data includes District of Columbia (DC), Puerto Rico (PR), and the Virgin Islands (VI)).

Stack Exchange Network

Web Scraping DOL with Selenium

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Web Scraping DOL with Selenium

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions