Python script using lxml, xpath and css selector also returning null list

Question

I tried to scrape href link for the next page from an html tag using xpath with lxml. But the xpath is returning null list whereas it was tested separately and it seems to work.

I've tried both css selector and xpath, both of them are returning null list.

The code is returning a null value whereas the xpath seems to work fine.

import sys
import time
import urllib.request
import random
from lxml import html 
import lxml.html 
import csv,os,json
import requests
from time import sleep
from lxml import etree

username = 'username'
password = 'password'
port = port
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%[email protected]:%d' %(username, session_id, password, port))
proxy_handler = urllib.request.ProxyHandler({
        'http': super_proxy_url,
        'https': super_proxy_url,})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = \[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
print('Performing request')

page = self.opener.open("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588").read()
pageR = requests.get("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588",headers={"User-Agent":"Mozilla/5.0"})

doc=html.fromstring(str(pageR))

html = lxml.html.fromstring(str(page))
links = html.cssselect('#pagnNextLink')
for link in links:
        print(link.attrib['href'])

linkRef = doc.xpath("//a[@id='pagnNextLink']/@href")
print(linkRef)
for post in linkRef:
    link="https://www.amazon.com%s" % post

I've tried two ways here and both of them seems to not work.

I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content. I've checked the links and I'm on the proper page to fetch this xpath/csslink.

can you please give me any suggestion? I'm stuck here :( @QHarr — Ajay Victor
– Ajay Victor, Commented Feb 16, 2019 at 11:33
the proxy will mask the mentioned issues, but still, it's also not responding that is what I'm concerned on. — Ajay Victor
– Ajay Victor, Commented Feb 16, 2019 at 11:34
One more thing I checked the "title" using xpath, whereas it's showing the expected output. — Ajay Victor
– Ajay Victor, Commented Feb 16, 2019 at 11:36

QHarr · Accepted Answer · 2019-02-16 11:47:10Z

1

Someone more experienced may give better advice on working with your set-up so I will simply indicate what I experienced:

When I used requests I sometimes got the link and sometimes not. When not, the response indicated it was checking I was not a bot and to ensure my browser allowed cookies.

With selenium I reliably got a result in my tests, though this may not be quick enough, or an option for you for other reasons.

from selenium import webdriver
d = webdriver.Chrome()
url = 'https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588'
d.get(url)
link = d.find_element_by_id('pagnNextLink').get_attribute('href')
print(link)

Selenium with proxy (Firefox):

Running Selenium Webdriver with a proxy in Python

Selenium with proxy (Chrome) - covered nicely here:

https://stackoverflow.com/a/11821751/6241235

edited Feb 16, 2019 at 11:47

answered Feb 16, 2019 at 11:43

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ajay Victor Over a year ago

thanks for the selenium code, but as of now I need this to be done with requests itself.

QHarr Over a year ago

understood. Hopefully someone with more Python experience will post :-)

Collectives™ on Stack Overflow

Python script using lxml, xpath and css selector also returning null list

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related