I tried to scrape href link for the next page from an html tag using xpath with lxml. But the xpath is returning null list whereas it was tested separately and it seems to work.
I've tried both css selector and xpath, both of them are returning null list.
The code is returning a null value whereas the xpath seems to work fine.
import sys
import time
import urllib.request
import random
from lxml import html
import lxml.html
import csv,os,json
import requests
from time import sleep
from lxml import etree
username = 'username'
password = 'password'
port = port
session_id = random.random()
super_proxy_url = ('http://%s-session-%s:%[email protected]:%d' %(username, session_id, password, port))
proxy_handler = urllib.request.ProxyHandler({
'http': super_proxy_url,
'https': super_proxy_url,})
opener = urllib.request.build_opener(proxy_handler)
opener.addheaders = \[('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')]
print('Performing request')
page = self.opener.open("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588").read()
pageR = requests.get("https://www.amazon.com/s/ref=lp_3564986011_pg_2/133-0918882-0523213?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1%2Cn%3A173508%2Cn%3A266162%2Cn%3A3564986011&page=2&ie=UTF8&qid=1550294588",headers={"User-Agent":"Mozilla/5.0"})
doc=html.fromstring(str(pageR))
html = lxml.html.fromstring(str(page))
links = html.cssselect('#pagnNextLink')
for link in links:
print(link.attrib['href'])
linkRef = doc.xpath("//a[@id='pagnNextLink']/@href")
print(linkRef)
for post in linkRef:
link="https://www.amazon.com%s" % post
I've tried two ways here and both of them seems to not work.
I'm using a proxy server, for accessing the links and it seems to work, as the "doc" variable is getting populated with the html content. I've checked the links and I'm on the proper page to fetch this xpath/csslink.
