1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Lsql
This lecture introduces the basics of the Python programming language. We will also cover basics of working with databases using the SQL language and SQLite3 lightweight database system.
The next three lectures
- Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
- Bioinformatics students will use several bioinformatics command-line tools
Contents
Overview, documentation
Python
- Popular programming language
- Advantages: powerful language features, extensive libraries
- Disadvantages: interpreted language, can be slow
- A very concise cheat sheet: [1]
- A more detailed tutorial: [2]
SQL
- Language for working with relational databases, more in a dedicated course
- We will cover basics of SQL and work with a simple DB system SQLite3
- Typical database systems are complex, use server-client architecture. SQLite3 is a simple "database" stored in one file. You can think of SQLite3 not as a replacement for Oracle but as a replacement for fopen().
- [3] SQLite3 documentation]
- SQL tutorial
- [SQLite3 in Python documentation
Outline of this lecture:
- We introduce a simple data set
- We look at several Python scripts for processing this data set
- Solve task A, where you create another such script
- We introduce basics of working with SQLite3 and writing SQL queries
- Solve tasks B1 and B2, where you write your own SQL queries
- We look at how to combine Python and SQLite
- Solve task C, where you write a program combining the two
- Students familiar with both Python and SQL may skip tasks A, B1, B2 and and do tasks C and D
Dataset for this lecture
- IMDb is an online database of movies and TV series with user ratings
- We have downloaded a preprocessed dataset of selected TV series ratings from GitHub
- From this dataset, we have selected 10 series with high average number of voting users
- Data are two files in csv format: list of series, list of episodes
File series.cvs contains one row per series
- Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC 2,Sherlock,BBC 1,Game of Thrones,HBO
File episodes.csv contains one row per episode:
- Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
- Here is a sample of 4 episodes from the Game of Thrones series
- If the episode title contains a comma, the whole title is in quotation marks
1,"Dark Wings, Dark Words",22,3,2,8.6,12714 1,No One,58,6,8,8.3,20709 1,Battle of the Bastards,59,6,9,9.9,138353 1,The Winds of Winter,60,6,10,9.9,93680
Several Python scripts
We will illustrate basic features of Python on several scripts working with these files.
prog1.py
The first script prints the second column (series title) from series.csv
#! /usr/bin/python3
# open a file for reading
with open('series.csv') as csvfile:
# iterate over lines of the input file
for line in csvfile:
# split a line into columns at commas
columns = line.split(",")
# print the second column
print(columns[1])
- Python uses indentation to delimit blocks. In this example, the with command starts a block and within this block the for command starts another block containing commands columns=... and print. The body of each block is indented several spaces relative to the enclosing block.
- Variables are not declared, but directly used. This program uses variables csvfile, line, columns.
- The open command opens a file (here for reading, but other options are available).
- The with command opens the file, stores the file handle in csvfile variable, executes all commands within its block and finally closes the file (read more)
- The for-loop iterates over all lines in the file, assigning each in variable line and executing the body of the block.
- Method split of the built-in string type str splits the line at every comma and returns a list of strings, one for every column of the table (see also other string methods)
- The final line prints the second column and the end of line character.
prog2.py
The following script prints the list of series of each TV channel
- For illustration we also separately count number of the series for each channel, but the count could be also obtained as the length of the list
#! /usr/bin/python3
from collections import defaultdict
# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)
# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)
# open a file and iterate over lines
with open('series.csv') as csvfile:
for line in csvfile:
# strip whitespace (e.g. end of line) from end of line
line = line.rstrip()
# split line into columns, find channel and series names
columns = line.split(",")
channel = columns[2]
series = columns[1]
# increase counter for channel
channel_counts[channel] += 1
# add series to list for the channel
channel_lists[channel].append(series)
# print counts
print("Counts:")
for (channel, count) in channel_counts.items():
print("The number of series for channel \"%s\" is %d"
% (channel, count))
# print series lists
print("\nLists:")
for channel in channel_lists:
list = ", ".join(channel_lists[channel])
print("Series for channel \"%s\": %s" % (channel,list))
- In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary channel_counts stores the number of series, channel_lists stores the list of series names.
- For simplicity we use a library data structure called defaultdict instead of a plain python dictionary. The reason is easier initialization: keys do not need to be explicitly inserted to the dictionary, but are initialized with a default value at first access.
- Reading of the input file is similar to the previous script
- Afterwards we iterate through the keys of both dictionaries and print both the keys and the values
- We format the output string using the % operator to replace %s and %d with values channel and count.
- Notice that when we print counts, we iterate through pairs (channel, count) returned by channel_counts.items() while when we print series, we iterate through keys of the dictionary
prog3.py
This script finds the episode with the highest number of votes among all episodes
- We use a library for csv parsing to deal with quotation marks around episode names with commas, such as "Dark Wings, Dark Words"
- This is done by first opening a file and then passing it as an argument to csv.reader, which returns a reader object used to iterate over rows.
#! /usr/bin/python3
import csv
# keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None
# open a file
with open('episodes.csv') as csvfile:
# create a reader for parsing csv files
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# iterate over rows already split into columns
for row in reader:
votes = int(row[6])
if votes > max_votes:
max_votes = votes
max_votes_episode = row[1]
# print result
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))
prog4.py
The following script shows an example of function definition
- The function reads a whole csv file into a 2d array
- The rest of the program calls this function twice for each of the two files
- This could be followed by some further processing of these 2d arrays
#! /usr/bin/python3
import csv
def read_csv_to_list(filename):
# create empty list
rows = []
# open a file
with open(filename) as csvfile:
# create a reader for parsing csv files
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# iterate over rows already split into columns
for row in reader:
rows.append(row)
return rows
series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print("the number of episodes is %d" % len(episodes))
# further processing of series and episodes...
Now do HWpython, task A
SQL and SQLite
Creating a database
SQLite3 database is a file with your data stored in some special format. To load our csv file to a SQLite database, run command:
sqlite3 series.db < create_db.sql
Contents of create_db.sql file:
CREATE TABLE series (
id INT,
title TEXT,
channel TEXT
);
.mode csv
.import series.csv series
CREATE TABLE episodes (
seriesId INT,
title TEXT,
orderInSeries INT,
season INT,
orderInSeason INT,
rating REAL,
votes INT
);
.mode csv
.import episodes.csv episodes
- The two CREATE TABLE commands create two tables named series and episodes
- For each column (attribute) of the table we list its name and type.
- Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command .import reads the txt file and stores it in a table.
Other useful SQLite3 commands;
- .schema tableName (lists columns of a given table)
- .mode column and .headers on (turn on human-friendly formatting (not good for further processing)
SQL queries
- Run sqlite3 series.db to get an SQLite command-line where you can interact with your database
- Then type the following queries which illustrate the basic features of SQL
/* switch on human-friendly formatting */
.mode column
.headers on
/* print title of each series (as prog1.py) */
SELECT title FROM series;
/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;
/* find the highest vote number among episodes */
SELECT MAX(votes) FROM episodes;
/* find the episode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes
ORDER BY votes DESC LIMIT 1;
/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
WHERE votes>50000 ORDER BY votes desc;
/* join series and episodes tables, print 10 episodes
* with the highest number of votes */
SELECT s.title, e.title, votes
FROM episodes AS e, series AS s
WHERE e.seriesId=s.id
ORDER BY votes desc limit 10;
/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() as series_count
FROM series GROUP BY channel;
/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
FROM episodes GROUP BY seriesId, season;
Now do #HWpython, tasks B1, B2
Accessing a database from Python
We will use sqlite3 library for Python to access data from the database and to process in further in the Python program.
read_db.py
- The following script illustrates running a SELECT query and the getting results
#! /usr/bin/python3
import sqlite3
# connect to a database
connection = sqlite3.connect('series.db')
# create a "cursor" for working with the database
cursor = connection.cursor()
# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
WHERE votes>? ORDER BY votes desc""", (threshold,))
# retrieve results of the query
for row in cursor:
print("Episode \"%s\" votes %s" % (row[0],row[1]))
# close db connection
connection.close()
write_db.py
This script illustrates creating a new database containing a multiplication table
#! /usr/bin/python3
import sqlite3
# connect to a database
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with the database
cursor = connection.cursor()
cursor.execute("""
CREATE TABLE mult_table (
a INT, b INT, mult INT)
""")
for a in range(1,11):
for b in range(1,11):
cursor.execute("INSERT INTO mult_table (a,b,mult) VALUES (?,?,?)",
(a,b,a*b))
# important: save the changes
connection.commit()
# close db connection
connection.close()
We can check the result by running command
sqlite3 multiplication.db "SELECT * FROM mult_table;"
Now do #HWpython, task C