1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lweb"

From MAD
Jump to navigation Jump to search
Line 24: Line 24:
  
 
We will use <tt>beautifulsoup4</tt> library for parsing HTML.
 
We will use <tt>beautifulsoup4</tt> library for parsing HTML.
* In your code, we recommend following the examples at the beginning of the [http://www.crummy.com/software/BeautifulSoup/bs4/doc/ documentation] and the example of [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors CSS selectors].
+
* In your code, we recommend following the examples at the beginning of the [http://www.crummy.com/software/BeautifulSoup/bs4/doc/ documentation] and the example of [http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors CSS selectors]. Also you can check out general [https://www.w3schools.com/cssref/css_selectors.asp](syntax) of CSS selectors.
 
* Information you need to extract is located within the structure of the HTML document
 
* Information you need to extract is located within the structure of the HTML document
 
* To find out, how is the document structured, use Inspect element feature in Chrome (right click on the text of interest within the website)
 
* To find out, how is the document structured, use Inspect element feature in Chrome (right click on the text of interest within the website)

Revision as of 11:18, 18 March 2020

HWweb

In this lecture, we will extract information from a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.

Scraping webpages

In Python, the simplest tool for scraping webpages is urllib2 library. Example usage:

import urllib2
f = urllib2.urlopen('http://www.python.org/')
print f.read()

You can also use requests package:

import requests
r = requests.get("http://en.wikipedia.org")
print(r.text[:10])

Parsing webpages

We will use beautifulsoup4 library for parsing HTML.

  • In your code, we recommend following the examples at the beginning of the documentation and the example of CSS selectors. Also you can check out general [1](syntax) of CSS selectors.
  • Information you need to extract is located within the structure of the HTML document
  • To find out, how is the document structured, use Inspect element feature in Chrome (right click on the text of interest within the website)
  • Based on this information, create a CSS selector

Parsing dates

To parse dates (written as a text), you have two options:

Other useful tips

  • Don't forget to commit to your SQLite3 database (call db.commit()).
  • SQL command CREATE TABLE IF NOT EXISTS can be useful at the start of your script.
  • Use screen command for long running scripts.
  • All packages are installed on our server. If you use your own laptop, you need to install them using pip (preferably in an virtualenv).