Scraping values from HTML header and saving as a CSV file in Python

Question

All,

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.

I am sure it's something simple but need a bit of help if possible!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

A copy of the HTML file or a link to the site would help. Otherwise its guessing in dark :( — Kevin Read
– Kevin Read, Commented Feb 6, 2011 at 17:34

Thomas K · Accepted Answer · 2011-02-06 18:53:16Z

4

Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!

Try starting like this:

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

Then you can use:

partition('=')[2] to cut off the "var gs_csv" bit.
strip(' \n"') to remove unwanted characters at each end (space, newline, ")
replace("\\n","\n") to sort out the new lines.

Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....

Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:

for line in data.splitlines():
    row = line.split(",")

edited Feb 6, 2011 at 18:53

answered Feb 6, 2011 at 17:32

Thomas K

40.7k7 gold badges88 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Patrick A Over a year ago

@Thomas - Thanks, that would make sense! I've only been Pythoning for a week or so - been using VBA for years - so still trying to find my feet and a little drunk on all the things it can do compared the world I'm used to.

Thomas K Over a year ago

@Patrick: You're welcome. Don't worry, before long you'll start to see the "one obvious way to do it". If you've not already come across that phrase, type import this.

Thomas K Over a year ago

@Patrick: P.S. A note on using stackoverflow: If this answer has helped you, you can "accept" it by clicking the tick to the left (You can only accept one answer per question).

Patrick A Over a year ago

@Thomas: Liking 'import this' ;-) The best way to learn is always to try something first and then ask for a couple of pointers; there's little to be gained by someone providing you with the answer straightaway.

Patrick A Over a year ago

@Thomas: Sorry to bug you once again. I've tried to extract the data I want but I am still ending up with " var gs_csv="HDR,PHYSICAL BM DATA,20110205,*\nPN,T_COTPS-4,1,20110205000000,490.000,20110205003000,490.000\nPN" rather than a new line each time \n

|

Patrick A · Accepted Answer · 2011-02-06 21:34:29Z

1

SOLUTION

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter

answered Feb 6, 2011 at 21:34

Patrick A

2771 gold badge3 silver badges12 bronze badges

2 Comments

Thomas K Over a year ago

Glad you worked it out. If you just want to save it, you don't need to separate it up and use a csv.writer. You can just do open("filename.csv","w").write(javascriptdata).

Patrick A Over a year ago

That's even tidier! Thanks once again...glad I signed up for this website...reminds me a lot of the old newsgroups with the friendly help

Collectives™ on Stack Overflow

Scraping values from HTML header and saving as a CSV file in Python

2 Answers 2

6 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related