0

All,

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.

I am sure it's something simple but need a bit of help if possible!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()
1
  • A copy of the HTML file or a link to the site would help. Otherwise its guessing in dark :( Commented Feb 6, 2011 at 17:34

2 Answers 2

4

Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!

Try starting like this:

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

Then you can use:

  1. partition('=')[2] to cut off the "var gs_csv" bit.
  2. strip(' \n"') to remove unwanted characters at each end (space, newline, ")
  3. replace("\\n","\n") to sort out the new lines.

Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....

Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:

for line in data.splitlines():
    row = line.split(",")
Sign up to request clarification or add additional context in comments.

6 Comments

@Thomas - Thanks, that would make sense! I've only been Pythoning for a week or so - been using VBA for years - so still trying to find my feet and a little drunk on all the things it can do compared the world I'm used to.
@Patrick: You're welcome. Don't worry, before long you'll start to see the "one obvious way to do it". If you've not already come across that phrase, type import this.
@Patrick: P.S. A note on using stackoverflow: If this answer has helped you, you can "accept" it by clicking the tick to the left (You can only accept one answer per question).
@Thomas: Liking 'import this' ;-) The best way to learn is always to try something first and then ask for a couple of pointers; there's little to be gained by someone providing you with the answer straightaway.
@Thomas: Sorry to bug you once again. I've tried to extract the data I want but I am still ending up with " var gs_csv="HDR,PHYSICAL BM DATA,20110205,*\nPN,T_COTPS-4,1,20110205000000,490.000,20110205003000,490.000\nPN" rather than a new line each time \n
|
1

SOLUTION

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "&param2=&param3=&param4=&param5="
bm_date = "2011-02-04"
bm_param6 = "&param6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter

2 Comments

Glad you worked it out. If you just want to save it, you don't need to separate it up and use a csv.writer. You can just do open("filename.csv","w").write(javascriptdata).
That's even tidier! Thanks once again...glad I signed up for this website...reminds me a lot of the old newsgroups with the friendly help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.