Scraping HTML from URLs in csv then printing to csv with python

Question

I am trying to scrape a date on a series of URLs that are in a csv and then output the dates to a new CSV.

I have the basic python code working but can't figure out how to load the CSV in (instead of pulling it from an array) and scrape each url and then output it to a new CSV. From reading a couple posts I think I would want to use the csv python module but can't get it working.

Here is my code for the scraping part

import urllib
import re

exampleurls =["http://www.domain1.com","http://www.domain2.com","http://www.domain3.com"]

i=0
while i<len(exampleurls):
    url = exampleurls[i]
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = 'on [0-9][0-9]\.[0-9][0-9]\.[0-9][0-9]'
    pattern = re.compile(regex)
    date = re.findall(pattern,htmltext)
    print date
    i+=1

Any help is much appreciated!

Ok, but you need to import the module. import csv Then you can try writing some code and post it here. — maurelio79
– maurelio79, Commented Jan 6, 2014 at 2:29
Yeah I got that part, sorry if that wasn't clear. I didn't include it because I didn't include the csv code. — NicoM
– NicoM, Commented Jan 6, 2014 at 2:56

fivetentaylor · Accepted Answer · 2014-01-06 19:10:30Z

1

If your csv looks like this:

"http://www.domain1.com","other column","yet another"
"http://www.domain2.com","other column","yet another"
...

Extract domains like this:

import urllib
import csv

with open('urlFile.csv') as f:
    reader = csv.reader(f)

    for rec in reader:
        htmlfile = urllib.urlopen(rec[0])
        ...

And if your url file just looks like this:

http://www.domain1.com
http://www.domain2.com
...

You could do something even cooler with list comprehensions like this:

urls = [x for x in open('urlFile')]

EDIT: reply to comment

You can either open a file in python like:

f = open('myurls.csv', 'w')
...
for rec in reader:
    ...
    f.write(urlstring)
f.close()

Or if you're on unix/linux just use print inside your code, then in bash:

python your_scraping_script.py > someoutfile.csv

edited Jan 6, 2014 at 19:10

answered Jan 6, 2014 at 6:01

fivetentaylor

1,2978 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

NicoM Over a year ago

Thanks a lot fivetentaylor! That worked! Do you know how I would then save the dates to a CSV file instead of printing them?

Collectives™ on Stack Overflow

Scraping HTML from URLs in csv then printing to csv with python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related