0

I want to scrape sample_info.csv file from https://depmap.org/portal/download/. Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.

url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)

This code gives me an error JSONDecodeError: Extra data: line 1 column 7250 (char 7249) I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration

2
  • The issue is that the string script_cleaned is from a JS script (it's a JavaScript object), not JSON. You'll need to convert the JS text to JSON somehow. Commented Mar 15, 2021 at 16:14
  • you've to loads it using JSON module. Commented Mar 15, 2021 at 17:14

1 Answer 1

1

It is probably easier in this context to use regular expressions, since the string is invalid JSON.

This RegEx tool (https://pythex.org/) can be useful for testing expressions.

import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
#  ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
#  ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
#  ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
#  ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
#  ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
#  ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
#  ('https://gygi.med.harvard.edu/publications/ccle',  'protein_quant_current_normalized.csv'),
#  ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]

Edit: This also works by passing the html_content directly (no need to BeautifulSoup).

url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! This works perfect for my solution :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.