3

I am trying to scrape a list from the following URL: https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland

Using Chrome's Developer Tools, I find that my content of interest is inside body > app-root > app-top > div ... . I tried finding this content using Python's BeautifulSoup4 package. Unfortunately, it is not possible to dive into the structure beyond the app-root tag. I am using the following code:

import requests
from bs4 import BeautifulSoup
import pprint

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

url = 'https://www.oncomap.de/centers?selectedOrgan=Darm&selectedCounty=Deutschland'
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html-parser")

mat_row = soup.select('body > app-root')

pp = pprint.PrettyPrinter()
for child in mat_row[0].descendants:
    pp.pprint(child)

There is not output from this code - no descendant (also tried children) is printed. I think I am dealing with a ReactJS div here. Would anyone have any hints how to process such content? Specifically, I am keen to scrape the main list on the page into a Python-readable table. THanks for your help!

1
  • you might have to use selenium Commented Jun 29, 2020 at 7:34

2 Answers 2

1

The data is loaded dynamically via JavaScript. But you can use requests module to load the data:

import json
import requests
from bs4 import BeautifulSoup


clinics_url = 'https://back.oncomap.de/api/direct/fulldb_clinics'
centers_url = 'https://back.oncomap.de/api/direct/fulldb_centers'

data1 = requests.get(clinics_url).json()
data2 = requests.get(centers_url).json()

clinics = {d['clinic_nr']:d for d in data1}

# uncomment this to print all data:
# print(json.dumps(data1, indent=4))
# print(json.dumps(data2, indent=4))

for c in data2:
    print(c['reg_nr'], c['inst1'], clinics.get(c['clinic_nr'], {}).get('inst1', '-'), c['url'], sep='\t')

Prints:

AB-Z001 G   Brustzentrum Stuttgart am Marienhospital    Marienhospital Stuttgart    https://www.marienhospital-stuttgart.de/interdisziplinaere-zentren/brustzentrum/
FAB-Z007-1 G    Universitäts-Brustzentrum Tübingen  Universitätsklinikum Tübingen, CCC Tübingen-Stuttgart   www.uni-frauenklinik-tuebingen.de/brustzentrum.html
FAB-Z010 G  Interdisziplinäres Brustkrebszentrum der Charité (IBZ) im Charité Comprehensive Cancer Center   Charité - Campus Mitte  https://cccc.charite.de/leistungen/organbereiche/brustkrebs/
FAB-Z012-1 G    Kooperatives Brustzentrum Klinikum Region Hannover  KRH Klinikum Siloah www.krh.eu/klinikum/SOH/zentren/brustzentrum
FAB-Z016 G  Brustzentrum Robert-Bosch-Krankenhaus   Robert-Bosch-Krankenhaus; Klinik Schillerhöhe   http://www.rbk.de/disziplinen/interdisziplinaere-zentren/brustzentrum.html
FAB-Z017 G  Brustzentrum Halle des Universitätsklinikums Halle (Saale)  Universitäts-Klinikum Halle-Saale   www.unifrauenklinik-halle.de
FAB-Z020 G  Brustzentrum im Sana Klinikum Lichtenberg   Sana Klinikum Lichtenberg   http://www.sana-kl.de/unser-leistungsspektrum/kliniken-institute/brustzentrum-des-sana-klinikum-lichtenberg.html
FAB-Z021 G  Interdisziplinäres Brustzentrum der ALB FILS KLINIKEN   Klinik am Eichert Göppingen www.alb-fils-kliniken.de
FAB-Z022    Kooperatives Brustzentrum Landshut  Klinikum Landshut   www.klinikum-landshut.de
FAB-Z023 G  Brustzentrum Saar Mitte CaritasKlinikum Saarbrücken St. Theresia    www.caritasklinik.de
FAB-Z024 G  Brustzentrum am Universitätsklinikum Hamburg-Eppendorf  Universitätsklinikum Hamburg-Eppendorf  www.uke.de/kliniken-institute/zentren/brustzentrum/index.html
FAB-Z025-1  Südthüringer Brustzentrum Suhl / Meiningen  SRH Zentralklinikum Suhl    www.srh.de
FAB-Z026 G  Brustzentrum Klinikum Oldenburg Klinikum Oldenburg  www.klinikum-oldenburg.de

...and so on.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, Andrej. This answer completely resolved my problem and is quite elegant. Note to self: try finding the API call in such cases, rather than parsing the HTML output.
1

Since the page is dynamically loaded, you won't get the correct html by just scraping using the requests package.

What you can do instead, is scraping with a headless browser and make it wait until a specific element has appeared in the page.

Here it is a tutorial on web scraping with Selenium (package to handle headless browsers): https://www.scrapingbee.com/blog/selenium-python/

In that tutorial, there is also a section titled "waiting for an element to be present" that looks like what you are looking for.

Also, here it is a stackoverflow question related to what you want to do: Wait until page is loaded with selenium webdriver

1 Comment

Hi Drago96, thanks for the response. I should look into Selenium more closely. Seems useful for a range of tasks...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.