2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z už opravených úloh nájdete na serveri v /grades/userid.txt
Dátumy odovzdania projektov:
1. termín: nedeľa 4.6. 22:00
2. termín: streda 20.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatváranie známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search


  • Program for today: basics of Python and SQL, bonus homework for 50% of weight of a regular HW.
  • In the next three lectures (after the Easter), you will use Python and SQLite3 and several advanced Python libraries for complex data processing.

Overview, documentation

Python: good sources for beginners:

  • A very concise cheat sheet: [1]
  • A more detailed tutorial: [2]


  • Language for working with relational databases, more in a dedicated course
  • We will cover basics of SQL and work with a simple DB system SQLite3
  • SQLite3 documentation: [3]
  • SQL tutorial: [4]
  • SQLite3 in Python [5]

Program for today:

  • We introduce a simple data set
  • We look at several python scripts for processing this data set
  • HW: You create another such script
  • We introduce basics of working directly with SQLite3
  • HW: You write your own queries
  • We look at how to combine Python and SQLite
  • HW: You write a program combining the two

Dataset for this week

  • IMDb is an online database of movies and TV series with user ratings
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub
  • From dataset this we have selected only 10 series with the highest average number of voting users
  • Data are 2 files in csv format: list of series, list of episodes

File series.cvs contains one row per series

  • Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC
1,Game of Thrones,HBO 

File episodes.csv contains one row per episode:

  • Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
  • Here is a sample of 4 episodes from Game of Thrones
  • If the episode title contains a comma, the whole tile is in quotation marks
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
1,No One,58,6,8,8.3,20709
1,Battle of the Bastards,59,6,9,9.9,138353
1,The Winds of Winter,60,6,10,9.9,93680

Several python scripts


Print the second column (series tile) from series.csv

#! /usr/bin/python3

# open a file for reading
with open('series.csv') as csvfile:
    # iterate over lines of the input file
    for line in csvfile:
        # split a line into columns at commas
        columns = line.split(",")
        # print the second column


Print list of series of each TV channel

  • For illustration we also separately count the series for each channel, but the count could be obtained as the length of the list
  • For simplicity we use library data structure defaultdict instead of plain python dictionary
#! /usr/bin/python3
from collections import defaultdict

# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we willl count the series
channel_counts = defaultdict(int)

# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)

# open a file and iterate over lines
with open('series.csv') as csvfile:
    for line in csvfile:
        # strip whitespace (e.g. end of line) from end of line
        line = line.rstrip()
        # split line into columns, find channel and series names
        columns = line.split(",")
        channel = columns[2]
        series = columns[1]
        # increase counter for channel
        channel_counts[channel] += 1
        # add series to list for the channel

# print counts
for channel in channel_counts:
    print("The number of series for channel \"%s\" is %d" 
    % (channel, channel_counts[channel]))

# print series lists
for channel in channel_lists:
    list = ", ".join(channel_lists[channel]) 
    print("series for channel \"%s\": %s" % (channel,list))


Find the episode with the highest number of votes among all episodes

  • We use a libary for csv parsing to deal with quotation marks.
#! /usr/bin/python3
import csv

#keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None

# open a file
with open('episodes.csv') as csvfile:
    # create a reader for parsin csv files
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # iterate over rows already split into columns
    for row in reader:
        votes = int(row[6])
        if votes > max_votes:
            max_votes = votes
            max_votes_episode = row[1]
# print result
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))


Example of function definition, reading the whole file into a 2d array

#! /usr/bin/python3
import csv

def read_csv_to_list(filename):
    # create empty list
    rows = []
    # open a file
    with open(filename) as csvfile:
        # create a reader for parsin csv files
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        # iterate over rows already split into columns
        for row in reader:
    return rows

series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print("the number of episodes is %d" % len(episodes))
# further processing of series and episodes...

Now do HW05, task A

SQL and SQLite

Creating a database

SQLite3 database is a file with your data stored in some special format. To load our csv file to a SQLite database, run command:

sqlite3 series.db < create_db.sql

Contents of create_db.pl:

  id INT,
  title TEXT,
  channel TEXT
.mode csv
.import series.csv series
CREATE TABLE episodes (
  seriesId INT,
  title TEXT,
  orderInSeries INT,
  season INT,
  orderInSeason INT,
  rating REAL,
  votes INT
.mode csv
.import episodes.csv episodes

SQL queries

Run sqlite3 series.db

  • the type on SQLite3 command line the following queries
  • The first two only switch on human-friendly formatting
/*  switch on human-friendly formatting */
.mode column
.headers on

/* print title of each series (as prog1.py) */
SELECT title FROM series;

/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;

/* find the highest number among episodes */
SELECT MAX(votes) FROM episodes;

/* find epsiode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes

/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
  WHERE votes>50000 ORDER BY votes desc;

/* join series and episodes tables, print 10 epsiodes
 * with the highest number of votes */
SELECT s.title, e.title, votes
  FROM episodes AS e, series AS s
  WHERE e.seriesId=s.id
  ORDER BY votes desc limit 10;

/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() as series_count
  FROM series GROUP BY channel;

/* print the number of episodes and avergae rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Now do HW05, tasks B1, B2

Accessing database from Python


  • Script illustrates running a SELECT query and getting results
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('series.db')
# create a "cursor" for working with th database
cursor = connection.cursor()

# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
  WHERE votes>? ORDER BY votes desc""", (threshold,))

# retrieve results of the query
for row in cursor:
    print("Episode \"%s\" votes %s" % (row[0],row[1]))
# close db connection


Script illustrates creating a new database containing a multiplication table

#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with th database
cursor = connection.cursor()

CREATE TABLE mult_table (
a INT, b INT, mult INT)

for a in range(1,11):
    for b in range(1,11):
        cursor.execute("INSERT INTO mult_table (a,b,mult) VALUES (?,?,?)",

# important: save the changes
# close db connection

We can check the result by running command

sqlite3 multiplication.db "SELECT * FROM mult_table;"

Now do HW05, task C