2

I tried to scrape href link for the next page from an html tag using xpath with lxml. But the xpath is returning null list whereas it was tested separately and it seems to work.

I've tried both css selector and xpath, both of them are returning null list.

The code is returning a null value whereas the xpath seems to work fine.

import sys
import time
import urllib.request
import random
from lxml import html 
import lxml.html 
import csv,os,json
import requests
from time import sleep
from lxml import etree

username = 'username'
password = 'password'
port = port
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%[email protected]:%d' %(username, session_id, password, port))
proxy_handler = urllib.request.ProxyHandler({
        'http': super_proxy_url,
        'https': super_proxy_url,})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = \[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
print('Performing request')

page = self.opener.open("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588").read()
pageR = requests.get("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588",headers={"User-Agent":"Mozilla/5.0"})

doc=html.fromstring(str(pageR))

html = lxml.html.fromstring(str(page))
links = html.cssselect('#pagnNextLink')
for link in links:
        print(link.attrib['href'])

linkRef = doc.xpath("//a[@id='pagnNextLink']/@href")
print(linkRef)
for post in linkRef:
    link="https://www.amazon.com%s" % post

I've tried two ways here and both of them seems to not work.

I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content. I've checked the links and I'm on the proper page to fetch this xpath/csslink.

xpath and css validation

6
  • Can you show the top part of the code as well? Commented Feb 16, 2019 at 11:16
  • updated.. @QHarr Commented Feb 16, 2019 at 11:30
  • can you please give me any suggestion? I'm stuck here :( @QHarr Commented Feb 16, 2019 at 11:33
  • the proxy will mask the mentioned issues, but still, it's also not responding that is what I'm concerned on. Commented Feb 16, 2019 at 11:34
  • One more thing I checked the "title" using xpath, whereas it's showing the expected output. Commented Feb 16, 2019 at 11:36

1 Answer 1

1

Someone more experienced may give better advice on working with your set-up so I will simply indicate what I experienced:

When I used requests I sometimes got the link and sometimes not. When not, the response indicated it was checking I was not a bot and to ensure my browser allowed cookies.

With selenium I reliably got a result in my tests, though this may not be quick enough, or an option for you for other reasons.

from selenium import webdriver
d = webdriver.Chrome()
url = 'https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588'
d.get(url)
link = d.find_element_by_id('pagnNextLink').get_attribute('href')
print(link)

Selenium with proxy (Firefox):

Running Selenium Webdriver with a proxy in Python

Selenium with proxy (Chrome) - covered nicely here:

https://stackoverflow.com/a/11821751/6241235

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for the selenium code, but as of now I need this to be done with requests itself.
understood. Hopefully someone with more Python experience will post :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.