1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "Lweb"
Jump to navigation
Jump to search
Line 7: | Line 7: | ||
== Scraping webpages == | == Scraping webpages == | ||
− | In Python, the simplest tool for | + | In Python, the simplest tool for downloading webpages is <tt>[https://docs.python.org/2/library/urllib2.html urllib2]</tt> library. Example usage: |
<syntaxhighlight lang="Python"> | <syntaxhighlight lang="Python"> | ||
import urllib2 | import urllib2 | ||
Line 14: | Line 14: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | You can also use <tt>requests</tt> package (this is recommended): | + | You can also use <tt>[https://requests.readthedocs.io/en/master/ requests]</tt> package (this is recommended): |
<syntaxhighlight lang="Python"> | <syntaxhighlight lang="Python"> | ||
import requests | import requests |
Revision as of 14:53, 24 March 2020
In this lecture, we will extract information from a website using Python and existing Python libraries. We will store the results in an SQLite database. These results will be analyzed further in the following lectures.
Scraping webpages
In Python, the simplest tool for downloading webpages is urllib2 library. Example usage:
import urllib2
f = urllib2.urlopen('http://www.python.org/')
print f.read()
You can also use requests package (this is recommended):
import requests
r = requests.get("http://en.wikipedia.org")
print(r.text[:10])
Parsing webpages
We will use beautifulsoup4 library for parsing HTML.
- In your code, we recommend following the examples at the beginning of the documentation and the example of CSS selectors. Also you can check out general syntax of CSS selectors.
- Information you need to extract is located within the structure of the HTML document
- To find out, how is the document structured, use Inspect element feature in Chrome (right click on the text of interest within the website)
- Based on this information, create a CSS selector
Parsing dates
To parse dates (written as a text), you have two options:
- datetime.strptime
- dateutil package.
Other useful tips
- Don't forget to commit to your SQLite3 database (call db.commit()).
- SQL command CREATE TABLE IF NOT EXISTS can be useful at the start of your script.
- Use screen command for long running scripts.
- All packages are installed on our server. If you use your own laptop, you need to install them using pip (preferably in an virtualenv).