Scraping csv file from url with React script

Question

I want to scrape sample_info.csv file from https://depmap.org/portal/download/. Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.

url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)

This code gives me an error JSONDecodeError: Extra data: line 1 column 7250 (char 7249) I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration

The issue is that the string script_cleaned is from a JS script (it's a JavaScript object), not JSON. You'll need to convert the JS text to JSON somehow. — Trevor Manz
– Trevor Manz, Commented Mar 15, 2021 at 16:14

Trevor Manz · Accepted Answer · 2021-03-15 17:03:49Z

1

It is probably easier in this context to use regular expressions, since the string is invalid JSON.

This RegEx tool (https://pythex.org/) can be useful for testing expressions.

import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
#  ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
#  ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
#  ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
#  ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
#  ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
#  ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
#  ('https://gygi.med.harvard.edu/publications/ccle',  'protein_quant_current_normalized.csv'),
#  ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]

Edit: This also works by passing the html_content directly (no need to BeautifulSoup).

url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)

edited Mar 15, 2021 at 17:03

answered Mar 15, 2021 at 16:55

Trevor Manz

2172 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

szuszfol Over a year ago

Thank you! This works perfect for my solution :)

Collectives™ on Stack Overflow

Scraping csv file from url with React script

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related