1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lsql"

From MAD
Jump to navigation Jump to search
(Created page with "<!-- NOTEX --> HWpython <!-- /NOTEX --> This lecture introduces the basics of the Python programming language. We will also cover basics of working with databases using t...")
 
Line 324: Line 324:
  
 
'''Now do [[#HWpython]], task C'''
 
'''Now do [[#HWpython]], task C'''
 
==HWpython==
 
 
<!-- NOTEX -->
 
See also the [[#Lpython|lecture]]
 
<!-- /NOTEX -->
 
 
===Introduction===
 
Choose one of the options:
 
* Tasks A, B1, B2, C (recommended for beginners)
 
* Tasks C, D (recommended for experienced Python/SQL programmers)
 
 
===Preparation===
 
Copy files:
 
<syntaxhighlight lang="bash">
 
mkdir python
 
cd python
 
cp -iv /tasks/python/* .
 
</syntaxhighlight>
 
 
The directory contains the following files:
 
* <tt>*.py</tt>: python scripts from the lecture, included for convenience
 
* <tt>series.csv</tt>, <tt>episodes.csv</tt>: data files introduced in the lecture
 
* <tt>create_db.sql</tt>: SQL commands to create the database needed in tasks B1, B2, C, D
 
<!-- NOTEX -->
 
* <tt>protocol.txt</tt>: fill in and submit the protocol.
 
<!-- /NOTEX -->
 
 
===Task A (Python)===
 
Write a script <tt>taskA.py</tt> which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:
 
<syntaxhighlight lang="bash">
 
./taskA.py > taskA.txt
 
</syntaxhighlight>
 
One of the lines of your output should be:
 
<pre>
 
The number of episodes for channel "HBO" is 76
 
</pre>
 
<!-- NOTEX -->
 
'''Submit''' file <tt>taskA.py</tt> with your script and the output file <tt>taskA.txt</tt>:
 
<!-- /NOTEX -->
 
 
Hints:
 
* A good place to start is <tt>prog4.py</tt> with reading both csv files and <tt>prog2.py</tt> with a dictionary of counters
 
* It might be useful to build a dictionary linking the series id to the channel name for that series
 
 
===Task B1 (SQL)===
 
To prepare the database for tasks B1, B2, C and D, run the command:
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db < create_db.sql
 
</syntaxhighlight>
 
 
To verify that your database was created correctly, you can run the following commands:
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db ".tables"
 
# output should be  episodes  series 
 
 
sqlite3 series.db "select count() from episodes; select count() from series;"
 
# output should be 348 and 10
 
</syntaxhighlight>
 
 
The [[#Lpython#SQL_queries|last query in the lecture]] counts the number of episodes and average rating per each season of each series:
 
<syntaxhighlight lang="sql">
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
  FROM episodes GROUP BY seriesId, season;
 
</syntaxhighlight>
 
Use join with the <tt>series</tt> table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file <tt>taskB1.sql</tt>. The first two lines of the file should be
 
<syntaxhighlight lang="sql">
 
.mode column
 
.headers on
 
</syntaxhighlight>
 
Run your query as follows:
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db < taskB1.sql > taskB1.txt
 
</syntaxhighlight>
 
For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
 
<pre>
 
True Detective  HBO        1          8              9.3     
 
True Detective  HBO        2          8              8.25     
 
</pre>
 
<!-- NOTEX -->
 
'''Submit'''  <tt>taskB1.sql</tt> and <tt>taskB1.txt</tt>
 
<!-- /NOTEX -->
 
 
===Task B2 (SQL)===
 
For each channel compute the total count and average rating of all their episodes. Write your SQL query to file <tt>taskB2.sql</tt>. As before, the first two lines of the file should be
 
<syntaxhighlight lang="sql">
 
.mode column
 
.headers on
 
</syntaxhighlight>
 
Run your query as follows:
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db < taskB2.sql > taskB2.txt
 
</syntaxhighlight>
 
For example, all 76 episodes for the two HBO series have average rating as follows:
 
<pre>
 
HBO        76          8.98947368421053
 
</pre>
 
<!-- NOTEX -->
 
'''Submit'''  <tt>taskB2.sql</tt> and <tt>taskB2.txt</tt>
 
<!-- /NOTEX -->
 
 
===Task C (Python+SQL)===
 
If you have not done so already, create an SQLite database, as explained at the beginning of [[#Task_B1 (SQL)|task B1]].
 
 
Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called <tt>seasons</tt> in the <tt>series.db</tt> database file
 
<syntaxhighlight lang="sql">
 
/* print the number of episodes and average rating per season and series */
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
  FROM episodes GROUP BY seriesId, season;
 
</syntaxhighlight>
 
* SQL can store the results from a query directly in a table, but in this task you should instead read each row of the <tt>SELECT</tt> query in Python and to store it by running <tt>INSERT</tt> command from Python
 
* Also do not forget to create the new table in the database with appropriate column names and types. You can execute <tt>CREATE TABLE</tt> command from Python
 
* The cursor from the <tt>SELECT</tt> query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
 
* If you change your database during debugging, you can start over by running the command for creating the database above
 
* Store the script as <tt>taskC.py</tt>.
 
 
To check that your table was created, you can run command
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db "SELECT * FROM seasons;"
 
</syntaxhighlight>
 
This will print many lines, including this one: <tt>"5|1|8|9.3"</tt> which is for season 1 of series 5 (True Detective).
 
 
<!-- NOTEX -->
 
'''Submit''' your script <tt>taskC.py</tt> and the modified database <tt>series.db</tt>.
 
<!-- /NOTEX -->
 
 
===Task D (SQL, optionally Python)===
 
For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.
 
* For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
 
* Print a table containing series name, season number x, average rating in season x and average rating in season x+1
 
* The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons to the highest drop.
 
* One option is to run a query in SQL in which you join the <tt>seasons</tt> table from task C with itself and select rows that belong to the same series and successive seasons
 
* You can also read the rows of the <tt>seasons</tt> table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
 
<!-- NOTEX -->
 
* '''Submit''' your code as <tt>taskD.py</tt> or <tt>taskD.sql</tt> and the resulting table as <tt>taskD.txt</tt>
 
<!-- /NOTEX -->
 
 
The output should start like this (the formatting may differ):
 
<pre>
 
series      season x    rating for x  rating for x+1 
 
----------  ----------  ------------  ----------------
 
Sherlock    1          8.825        9.26666666666667
 
Breaking B  4          9.0          9.375         
 
</pre>
 
 
When using SQL without Python, include the following two lines in <tt>taskD.sql</tt>
 
<syntaxhighlight lang="sql">
 
.mode column
 
.headers on
 
</syntaxhighlight>
 
and run your query as <tt>sqlite3 series.db < taskD.sql > taskD.txt</tt>
 

Revision as of 13:22, 9 March 2020

HWpython

This lecture introduces the basics of the Python programming language. We will also cover basics of working with databases using the SQL language and SQLite3 lightweight database system.

The next three lectures

  • Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
  • Bioinformatics students will use several bioinformatics command-line tools

Overview, documentation

Python

  • Popular programming language
  • Advantages: powerful language features, extensive libraries
  • Disadvantages: interpreted language, can be slow
  • A very concise cheat sheet: [1]
  • A more detailed tutorial: [2]

SQL

  • Language for working with relational databases, more in a dedicated course
  • We will cover basics of SQL and work with a simple DB system SQLite3
  • Typical database systems are complex, use server-client architecture. SQLite3 is a simple "database" stored in one file. You can think of SQLite3 not as a replacement for Oracle but as a replacement for fopen().
  • [3] SQLite3 documentation]
  • SQL tutorial
  • [SQLite3 in Python documentation

Outline of this lecture:

  • We introduce a simple data set
  • We look at several Python scripts for processing this data set
  • Solve task A, where you create another such script
  • We introduce basics of working with SQLite3 and writing SQL queries
  • Solve tasks B1 and B2, where you write your own SQL queries
  • We look at how to combine Python and SQLite
  • Solve task C, where you write a program combining the two
  • Students familiar with both Python and SQL may skip tasks A, B1, B2 and and do tasks C and D

Dataset for this lecture

  • IMDb is an online database of movies and TV series with user ratings
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub
  • From this dataset, we have selected 10 series with high average number of voting users
  • Data are two files in csv format: list of series, list of episodes

File series.cvs contains one row per series

  • Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC
2,Sherlock,BBC
1,Game of Thrones,HBO 

File episodes.csv contains one row per episode:

  • Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
  • Here is a sample of 4 episodes from the Game of Thrones series
  • If the episode title contains a comma, the whole title is in quotation marks
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
1,No One,58,6,8,8.3,20709
1,Battle of the Bastards,59,6,9,9.9,138353
1,The Winds of Winter,60,6,10,9.9,93680

Several Python scripts

We will illustrate basic features of Python on several scripts working with these files.


prog1.py

The first script prints the second column (series title) from series.csv

#! /usr/bin/python3

# open a file for reading
with open('series.csv') as csvfile:
    # iterate over lines of the input file
    for line in csvfile:
        # split a line into columns at commas
        columns = line.split(",")
        # print the second column
        print(columns[1])

prog2.py

Print the list of series of each TV channel

  • For illustration we also separately count the series for each channel, but the count could be obtained as the length of the list
  • For simplicity we use library data structure defaultdict instead of plain python dictionary
#! /usr/bin/python3
from collections import defaultdict

# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)

# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)

# open a file and iterate over lines
with open('series.csv') as csvfile:
    for line in csvfile:
        # strip whitespace (e.g. end of line) from end of line
        line = line.rstrip()
        # split line into columns, find channel and series names
        columns = line.split(",")
        channel = columns[2]
        series = columns[1]
        # increase counter for channel
        channel_counts[channel] += 1
        # add series to list for the channel
        channel_lists[channel].append(series)

# print counts
print("Counts:")
for channel in channel_counts:
    print("The number of series for channel \"%s\" is %d" 
    % (channel, channel_counts[channel]))


# print series lists
print("\nLists:")
for channel in channel_lists:
    list = ", ".join(channel_lists[channel]) 
    print("series for channel \"%s\": %s" % (channel,list))

prog3.py

Find the episode with the highest number of votes among all episodes

  • We use a library for csv parsing to deal with quotation marks.
#! /usr/bin/python3
import csv

#keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None

# open a file
with open('episodes.csv') as csvfile:
    # create a reader for parsing csv files
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # iterate over rows already split into columns
    for row in reader:
        votes = int(row[6])
        if votes > max_votes:
            max_votes = votes
            max_votes_episode = row[1]
        
# print result
print("Maximum votes %d in episode \"%s\"" % (max_votes, max_votes_episode))

prog4.py

Example of function definition, reading the whole file into a 2d array

#! /usr/bin/python3
import csv

def read_csv_to_list(filename):
    # create empty list
    rows = []
    # open a file
    with open(filename) as csvfile:
        # create a reader for parsing csv files
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        # iterate over rows already split into columns
        for row in reader:
            rows.append(row)
    return rows

series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print("the number of episodes is %d" % len(episodes))
# further processing of series and episodes...

Now do #HWpython, task A

SQL and SQLite

Creating a database

SQLite3 database is a file with your data stored in some special format. To load our csv file to a SQLite database, run command:

sqlite3 series.db < create_db.sql

Contents of create_db.sql file:

CREATE TABLE series (
  id INT,
  title TEXT,
  channel TEXT
);
.mode csv
.import series.csv series
CREATE TABLE episodes (
  seriesId INT,
  title TEXT,
  orderInSeries INT,
  season INT,
  orderInSeason INT,
  rating REAL,
  votes INT
);
.mode csv
.import episodes.csv episodes
  • The two CREATE TABLE commnds create two tables naamed series and episodes
  • For each column (attribute) of the table we list if name and type.
  • Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command .import reads the txt file and stores it in a table.

Other useful SQLite3 commands;

  • .schema tableName (lists columns of a given table)
  • .mode column and .headers on (turn on human-friendly formatting (not good for further processing)

SQL queries

  • Run sqlite3 series.db to get an SQLite command-line where you can interact with your database
  • Then type the following queries which illustrate the basic features of SQL
/*  switch on human-friendly formatting */
.mode column
.headers on

/* print title of each series (as prog1.py) */
SELECT title FROM series;

/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;

/* find the highest vote number among episodes */
SELECT MAX(votes) FROM episodes;

/* find the episode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes
  ORDER BY votes DESC LIMIT 1;

/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
  WHERE votes>50000 ORDER BY votes desc;

/* join series and episodes tables, print 10 episodes
 * with the highest number of votes */
SELECT s.title, e.title, votes
  FROM episodes AS e, series AS s
  WHERE e.seriesId=s.id
  ORDER BY votes desc limit 10;

/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() as series_count
  FROM series GROUP BY channel;

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Now do #HWpython, tasks B1, B2

Accessing a database from Python

We will use sqlite3 library for Python to access data from the database and to process in further in the Python program.

read_db.py

  • The following script illustrates running a SELECT query and the getting results
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('series.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
  WHERE votes>? ORDER BY votes desc""", (threshold,))

# retrieve results of the query
for row in cursor:
    print("Episode \"%s\" votes %s" % (row[0],row[1]))
    
# close db connection
connection.close()

write_db.py

This script illustrates creating a new database containing a multiplication table

#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

cursor.execute("""
CREATE TABLE mult_table (
a INT, b INT, mult INT)
""")

for a in range(1,11):
    for b in range(1,11):
        cursor.execute("INSERT INTO mult_table (a,b,mult) VALUES (?,?,?)",
                       (a,b,a*b))

# important: save the changes
connection.commit()
    
# close db connection
connection.close()

We can check the result by running command

sqlite3 multiplication.db "SELECT * FROM mult_table;"

Now do #HWpython, task C