2

Let's say I would like to scrape some metadata from a website:

https://www.diepresse.com/4913597/autocluster-buhlt-um-osterreich-teststrecke-fur-google-autos

To be more precise, i.e. from the Key fullChannel the Value /home/wirtschaft/international from this <script> tag:

<script>    
let pageBreakpoint = 'desktop';
let _screen = window.innerWidth;
if (_screen < 640) {
    pageBreakpoint = 'mobile';
} else if (_screen < 1024) {
    pageBreakpoint = 'tablet';
}

var dataLayer = window.dataLayer || [];
    dataLayer.push({
        'siteId': 'dpo',
        'contentId': '4913597',
        'pageType': 'article',
        'contentTitle': 'Autocluster buhlt um Österreich-Teststrecke für Google-Autos',
        'contentAuthor': '',
        'contentElements': '',
        'contentType': 'default',
        'pageTags': '',
        'wordCount': '264',
        'wordCountRounded': '400',
        'contentSource': '',
        'contentPublishingDate': '',
        'contentPublishingDateFormat': '28/01/2016',
        'contentPublishingTime': '08:52',
        'contentPublishingTimestamp': '28/01/2016 08:52:00',
        'contentRepublishingTimestamp': '28/01/2016 08:52:00',
        'contentTemplate': 'default',
        'metaCategory': '',
        'channel': 'international',
        'fullChannel': '/home/wirtschaft/international',
        'canonicalUrl': '',
        'fullUrl': window.location.href,
        'oewaPath': 'RedCont/Wirtschaft/Wirtschaftspolitik',
        'oewaPage': 'homepage',
        'isPremium':'no',
        'isPremiumArticle': 'free',
        'pageBreakpoint': pageBreakpoint,
        'userId': ''
    });
</script>

Right now I am using Selenium and Xpath and can't really figure out how to use regex on this:

#this doesnt work
driver.find_element_by_xpath("//script[text()]")

Any suggestions?

2 Answers 2

1

Use JavaScript Executor to Get the var value datalayer.It will return as json array.

Then get the value of key fullChannel.

driver.get("https://www.diepresse.com/4913597/autocluster-buhlt-um-osterreich-teststrecke-fur-google-autos")
datalayer=driver.execute_script("return dataLayer")
print(datalayer)
print(datalayer[0]['fullChannel'])

Output:

[{'oewaPage': 'homepage', 'contentTitle': 'Autocluster buhlt um Österreich-Teststrecke für Google-Autos', 'userId': '', 'wordCount': '264', 'contentSource': '', 'contentPublishingDate': '', 'contentElements': '', 'contentAuthor': '', 'fullUrl': 'https://www.diepresse.com/4913597/autocluster-buhlt-um-osterreich-teststrecke-fur-google-autos', 'wordCountRounded': '400', 'contentTemplate': 'default', 'canonicalUrl': '', 'contentPublishingTime': '08:52', 'metaCategory': '', 'siteId': 'dpo', 'contentPublishingDateFormat': '28/01/2016', 'isPremium': 'no', 'oewaPath': 'RedCont/Wirtschaft/Wirtschaftspolitik', 'contentRepublishingTimestamp': '28/01/2016 08:52:00', 'contentPublishingTimestamp': '28/01/2016 08:52:00', 'pageTags': '', 'pageBreakpoint': 'desktop', 'contentType': 'default', 'fullChannel': '/home/wirtschaft/international', 'isPremiumArticle': 'free', 'contentId': '4913597', 'channel': 'international', 'pageType': 'article'}, {'faktorVendorData4': 'notset', 'event': 'faktorData', 'faktorData4': 'notset', 'gtm.uniqueEventId': 9, 'faktorData1': 'notset', 'faktorData2': 'notset', 'faktorData5': 'notset', 'faktorData3': 'notset'}, {'gtm.uniqueEventId': 3, 'gtm.start': 1569877670044, 'event': 'gtm.js'}, {'aboStatus': '', 'userId': '', 'userType': 'default', 'userStatus': 'logout'}, {'gtm.uniqueEventId': 6, 'event': 'gtm.dom'}, {'gtm.uniqueEventId': 14, 'gtm.start': 1569877672926, 'event': 'gtm.js'}, {'faktorGdprApplies': 1}, {'gtm.uniqueEventId': 15, 'event': 'gtm.load'}]

Key value fullChannel

/home/wirtschaft/international
Sign up to request clarification or add additional context in comments.

Comments

1

Your XPath to find script seems to be wrong -- try this:

script = driver.find_element_by_xpath("//script[contains(text(), 'let pageBreakpoint')]")

Then, you can use some string parsing methods to extract the value from fullChannel:

# Get index of string 'fullChannel'
fullChannelTextIndex = script.index('\'fullChannel\': ')

# Shorten script string by removing everything before 'fullChannel'
simplifiedScript = script[fullChannelTextIndex : len(script)-1]

# Call split on 'canonicalUrl', which appears AFTER 'fullChannel'
# Then replace 'fullChannel' text to get just the field value
fullChannelValue = simplifiedScript .split('\'canonicalUrl\': ')[0].replace('\'fullChannel\': ', '').replace(',', '')

print(fullChannelValue)

This produces the output '/home/wirtschaft/international'

enter image description here

There are likely more efficient ways to do this than through Selenium, but I will leave my Selenium answer here in case you want to go this route.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.