Lweb

HWweb

Sometimes you may be interested in processing data which is available in the form of a website consisting of multiple webpages (for example an e-shop with one page per item or a discussion forum with pages of individual users and individual discussion topics).

In this lecture, we will extract information from such a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.

Scraping webpages

In Python, the simplest tool for downloading webpages is urllib2 library. Example usage:

import urllib2
f = urllib2.urlopen('http://www.python.org/')
print f.read()

You can also use requests package (this is recommended):

import requests
r = requests.get("http://en.wikipedia.org")
print(r.text[:10])

Parsing webpages

When you download one page from a website, it is in HTML format and you need to extract useful information from it. We will use beautifulsoup4 library for parsing HTML.

In your code, we recommend following the examples at the beginning of the documentation and the example of CSS selectors. Also you can check out general syntax of CSS selectors.
Information you need to extract is located within the structure of the HTML document
To find out, how is the document structured, use Inspect element feature in Chrome (right click on the text of interest within the website). For example this text is located within LI element, which is within UL element, which is in 3 nested DIV elements, one BODY element and one HTML element.
Based on this information, create a CSS selector

Parsing dates

To parse dates (written as a text), you have two options:

datetime.strptime
dateutil package.

Other useful tips

Don't forget to commit to your SQLite3 database (call db.commit()).
SQL command CREATE TABLE IF NOT EXISTS can be useful at the start of your script.
Use screen command for long running scripts.
All packages are installed on our server. If you use your own laptop, you need to install them using pip (preferably in an virtualenv).

Lweb

Contents

Scraping webpages

Parsing webpages

Parsing dates

Other useful tips

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools