1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "Lsql"

From MAD
Jump to navigation Jump to search
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
[[HWpython]]
+
[[HWsql]]
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
 
This lecture introduces the basics of the Python programming language and the basics of working with databases using the SQL language and SQLite3 lightweight database system.  
 
This lecture introduces the basics of the Python programming language and the basics of working with databases using the SQL language and SQLite3 lightweight database system.  
  
<!-- NOTEX -->
 
 
The next three lectures
 
The next three lectures
* Computer science students will use Python, SQLite3 and several advanced Python libraries for complex data processing
+
* Computer science and data science students will use Python, SQLite3 and several advanced Python libraries for complex data processing.
* Bioinformatics students will use several bioinformatics command-line tools
+
* Bioinformatics students will use several bioinformatics command-line tools.
<!-- /NOTEX -->
 
  
 
==Overview, documentation==
 
==Overview, documentation==
  
 
'''Outline of this lecture'''
 
'''Outline of this lecture'''
* We introduce a simple data set
+
* We introduce a simple data set.
* We show several Python scripts for processing this data set. <!-- NOTEX -->We will skip this part because most of you know Python. <!-- /NOTEX -->
+
* In a separate [[Python|tutorial]] for beginners in Python, we have several Python scripts for processing this data set.  
 
* We introduce basics of working with SQLite3 and writing SQL queries.
 
* We introduce basics of working with SQLite3 and writing SQL queries.
 
* We look at how to combine Python and SQLite.
 
* We look at how to combine Python and SQLite.
Line 25: Line 23:
 
* [https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf A very concise cheat sheet]
 
* [https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf A very concise cheat sheet]
 
* [https://docs.python.org/3/tutorial/ A more detailed tutorial]
 
* [https://docs.python.org/3/tutorial/ A more detailed tutorial]
 +
* [[Python|Our tutorial]]
  
 
'''SQL'''
 
'''SQL'''
 
* SQL is a language for working with relational databases. It is covered in more detail in dedicated database courses.
 
* SQL is a language for working with relational databases. It is covered in more detail in dedicated database courses.
 
* We will cover basics of SQL and work with a simple DB system SQLite3.  
 
* We will cover basics of SQL and work with a simple DB system SQLite3.  
* Advantages of storing data in a database rather than, say, csv files:
+
* Advantages of storing data in a database rather than CSV files and similar:
 
** A database can contain multiple tables and connect them through shared identifiers.
 
** A database can contain multiple tables and connect them through shared identifiers.
 
** You can create indices which allow fast access to records matching your criteria without searching through all data.
 
** You can create indices which allow fast access to records matching your criteria without searching through all data.
Line 41: Line 40:
  
 
==Dataset for this lecture==
 
==Dataset for this lecture==
* [https://www.imdb.com/ IMDb] is an online database of movies and TV series with user ratings
+
* [https://www.imdb.com/ IMDb] is an online database of movies and TV series with user ratings.
* We have downloaded a preprocessed dataset of selected TV series ratings from [https://github.com/nazareno/imdb-series/ GitHub]
+
* We have downloaded a preprocessed dataset of selected TV series ratings from [https://github.com/nazareno/imdb-series/ GitHub].
* From this dataset, we have selected 10 series with high average number of voting users
+
* From this dataset, we have selected 10 series with high average number of voting users.
* Data are two files in csv format: list of series, list of episodes
+
* Data are two files in the CSV format: list of series, list of episodes.
* csv stands for comma-separated values
+
* CSV stands for comma-separated values.
  
 
File <tt>series.cvs</tt> contains one row per series  
 
File <tt>series.cvs</tt> contains one row per series  
Line 57: Line 56:
 
File <tt>episodes.csv</tt> contains one row per episode:
 
File <tt>episodes.csv</tt> contains one row per episode:
 
* Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
 
* Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
* Here is a sample of 4 episodes from the Game of Thrones series
+
* Here is a sample of 4 episodes from the Game of Thrones series.
* If the episode title contains a comma, the whole title is in quotation marks
+
* If the episode title contains a comma, the whole title is in quotation marks.
 
<pre>
 
<pre>
 
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
 
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
Line 66: Line 65:
 
</pre>
 
</pre>
  
Note that a different version of this data was used already in the [[Lperl#The_first_input_file_for_today:_TV_series|lecture on Perl]].
+
'''Notes'''
 
+
* A different version of this data was used already in the [[Lperl#The_first_input_file_for_today:_TV_series|lecture on Perl]].
==Several Python scripts==
+
* A [[Python|Python tutorial]] shows several scripts for processing these files which are basically longer versions of some of the SQL queries below.
 
 
We will illustrate basic features of Python on several scripts working with these CSV files.
 
* In the first two examples we just use standard Python functions for reading files and split lines into columns by <tt>split</tt> command.
 
* This does not work well for episodes.csv file where comma sometimes separates columns and sometimes is in quotes within a column. Therefore we use [https://docs.python.org/3/library/csv.html csv module], which is one the the standard Python modules.
 
* Alternatively, one could import CSV files via more complex libraries, such as [https://pandas.pydata.org/ Pandas].
 
 
 
===prog1.py===
 
The first script prints the second column (series title) from <tt>series.csv</tt>
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
 
 
# open a file for reading
 
with open('series.csv') as csvfile:
 
    # iterate over lines of the input file
 
    for line in csvfile:
 
        # split a line into columns at commas
 
        columns = line.split(",")
 
        # print the second column
 
        print(columns[1])
 
</syntaxhighlight>
 
 
 
* Python uses indentation to delimit blocks. In this example, the <tt>with</tt> command starts a block and within this block the <tt>for</tt> command starts another block containing commands <tt>columns=...</tt> and <tt>print</tt>. The body of each block is indented several spaces relative to the enclosing block.
 
* Variables are not declared, but directly used. This program uses variables <tt>csvfile, line, columns</tt>.
 
* The <tt>open</tt> command opens a file (here for reading, but other options are [https://docs.python.org/3/library/functions.html#open available]).
 
* The [https://www.geeksforgeeks.org/with-statement-in-python/ <tt>with</tt> command] opens the file, stores the file handle in  <tt>csvfile</tt> variable,  executes all commands within its block and finally closes the file.
 
* The for-loop iterates over all lines in the file, assigning each in variable <tt>line</tt> and executing the body of the block.
 
* Method <tt>split</tt> of the built-in string type <tt>str</tt> splits the line at every comma and returns a list of strings, one for every column of the table (see also other [https://docs.python.org/3/library/stdtypes.html#string-methods string methods])
 
* The final line prints the second column and the end of line character.
 
 
 
===prog2.py===
 
The following script prints the list of series of each TV channel.
 
* For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list.
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
from collections import defaultdict
 
 
 
# Create a dictionary in which default value
 
# for non-existent key is 0 (type int)
 
# For each channel we will count the series
 
channel_counts = defaultdict(int)
 
 
 
# Create a dictionary for keeping a list of series per channel
 
# default value empty list
 
channel_lists = defaultdict(list)
 
 
 
# open a file and iterate over lines
 
with open('series.csv') as csvfile:
 
    for line in csvfile:
 
        # strip whitespace (e.g. end of line) from end of line
 
        line = line.rstrip()
 
        # split line into columns, find channel and series names
 
        columns = line.split(",")
 
        channel = columns[2]
 
        series = columns[1]
 
        # increase counter for channel
 
        channel_counts[channel] += 1
 
        # add series to list for the channel
 
        channel_lists[channel].append(series)
 
 
 
# print counts
 
print("Counts:")
 
for (channel, count) in channel_counts.items():
 
    print(f"Channel \"{channel}\" has {count} series.")
 
 
 
# print series lists
 
print("\nLists:")
 
for channel in channel_lists:
 
    list = ", ".join(channel_lists[channel])
 
    print("Series for channel \"%s\": %s" % (channel,list))
 
</syntaxhighlight>
 
* In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary <tt>channel_counts</tt> stores the number of series, <tt>channel_lists</tt> stores the list of series names.
 
* For simplicity we use a library data structure called <tt>defaultdict</tt> instead of a plain python dictionary. The reason is easier initialization: keys do not need  to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
 
* Reading of the input file is similar to the previous script.
 
* Afterwards we iterate through the keys of both dictionaries and print both the keys and the values.
 
* We format the output string using f-strings <tt>f"..."</tt> where expressions in <tt>{ }</tt> are evaluated and substituted to the string. Formatting similar to C-style <tt>printf</tt>, e.g. <tt>print(f"{2/3:.3f}").</tt>
 
* Notice that when we print counts, we iterate through pairs <tt>(channel, count)</tt> returned by <tt>channel_counts.items()</tt>, while when we print series, we iterate through keys of the dictionary.
 
 
 
===prog3.py===
 
This script finds the episode with the highest number of votes among all episodes.
 
* We use a library for csv parsing to deal with quotation marks around episode names with commas, such as <tt>"Dark Wings, Dark Words"</tt>.
 
* This is done by first opening a file and then passing it as an argument to <tt>csv.reader</tt>, which returns a reader object used to iterate over rows.
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
import csv
 
 
 
# keep maximum number of votes and its episode
 
max_votes = 0
 
max_votes_episode = None
 
 
 
# open a file
 
with open('episodes.csv') as csvfile:
 
    # create a reader for parsing csv files
 
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
 
    # iterate over rows already split into columns
 
    for row in reader:
 
        votes = int(row[6])
 
        if votes > max_votes:
 
            max_votes = votes
 
            max_votes_episode = row[1]
 
       
 
# print result
 
print(f"Maximum votes {max_votes} in episode \"{max_votes_episode}\"")
 
</syntaxhighlight>
 
 
 
===prog4.py===
 
The following script shows an example of function definition.
 
* The function reads a whole csv file into a 2d array.
 
* The rest of the program calls this function twice for each of the two files.
 
* This could be followed by some further processing of these 2d arrays.
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
import csv
 
 
 
def read_csv_to_list(filename):
 
    # create empty list
 
    rows = []
 
    # open a file
 
    with open(filename) as csvfile:
 
        # create a reader for parsing csv files
 
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
 
        # iterate over rows already split into columns
 
        for row in reader:
 
            rows.append(row)
 
    return rows
 
 
 
series = read_csv_to_list('series.csv')
 
episodes = read_csv_to_list('episodes.csv')
 
print(f"The number of episodes is {len(episodes)}.")
 
# further processing of series and episodes...
 
</syntaxhighlight>
 
  
 
==SQL and SQLite==
 
==SQL and SQLite==
Line 289: Line 158:
 
More complicated concepts:
 
More complicated concepts:
 
* <tt>GROUP BY</tt> allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc).
 
* <tt>GROUP BY</tt> allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc).
* If you list two tables in <tt>FROM</tt>, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the <tt>FROM</tt> clause to only those that have a matching ID (for example <tt>WHERE e.seriesId=s.id</tt> in one of the queries above).
+
* If you list two tables in <tt>FROM</tt>, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the <tt>WHERE</tt> clause to only those that have a matching ID (for example <tt>WHERE e.seriesId=s.id</tt> in one of the queries above).
  
 
==Accessing a database from Python==
 
==Accessing a database from Python==
Line 296: Line 165:
  
 
===read_db.py===
 
===read_db.py===
* The following script illustrates running a <tt>SELECT</tt> query and getting the results
+
* The following script illustrates running a <tt>SELECT</tt> query and getting the results.
 +
* To start using a database, we first create objects called Connection and Cursor. We use the cursor to run individual SQl queries.
 +
* If we want to use particular values in the queries, which we have stored in variables, we use placeholders "?" in the query itself and pass the values to the <tt>execute</tt> command as a tuple.
 +
** sqlite3 module ensures that the values are passed to the database correctly. If we inserted the values directly into the query string, we could face problems if the values contain e.g. quotation marks leading to SQL syntax errors or even [https://en.wikipedia.org/wiki/SQL_injection SQL injection attacks].
 +
 
 
<syntaxhighlight lang="Python">
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3
Line 322: Line 195:
 
===write_db.py===
 
===write_db.py===
 
This script illustrates creating a new database containing a multiplication table.
 
This script illustrates creating a new database containing a multiplication table.
 +
* When we modify the database, it is important to run <tt>commit</tt> command to actually write them to the disk.
 
<syntaxhighlight lang="Python">
 
<syntaxhighlight lang="Python">
 
#! /usr/bin/python3
 
#! /usr/bin/python3

Latest revision as of 07:48, 14 March 2024

HWsql

This lecture introduces the basics of the Python programming language and the basics of working with databases using the SQL language and SQLite3 lightweight database system.

The next three lectures

  • Computer science and data science students will use Python, SQLite3 and several advanced Python libraries for complex data processing.
  • Bioinformatics students will use several bioinformatics command-line tools.

Overview, documentation

Outline of this lecture

  • We introduce a simple data set.
  • In a separate tutorial for beginners in Python, we have several Python scripts for processing this data set.
  • We introduce basics of working with SQLite3 and writing SQL queries.
  • We look at how to combine Python and SQLite.

Python

SQL

  • SQL is a language for working with relational databases. It is covered in more detail in dedicated database courses.
  • We will cover basics of SQL and work with a simple DB system SQLite3.
  • Advantages of storing data in a database rather than CSV files and similar:
    • A database can contain multiple tables and connect them through shared identifiers.
    • You can create indices which allow fast access to records matching your criteria without searching through all data.
    • A database system can compute complex queries without loading all data into memory.
  • Typical database systems are complex, use server-client architecture. In contrast, SQLite3 is a simple "database" stored in one file.
    • It is best used by a single process at a time, although multiple processes can read concurrently.
    • It can be easily created, does not need extensive configuration typical for more complex database systems.
  • SQLite3 documentation
  • SQL tutorial
  • SQLite3 in Python documentation

Dataset for this lecture

  • IMDb is an online database of movies and TV series with user ratings.
  • We have downloaded a preprocessed dataset of selected TV series ratings from GitHub.
  • From this dataset, we have selected 10 series with high average number of voting users.
  • Data are two files in the CSV format: list of series, list of episodes.
  • CSV stands for comma-separated values.

File series.cvs contains one row per series

  • Columns: (0) series id, (1) series title, (2) TV channel:
3,Breaking Bad,AMC
2,Sherlock,BBC
1,Game of Thrones,HBO 

File episodes.csv contains one row per episode:

  • Columns: (0) series id, (1) episode title, (2) episode order within the whole series, (3) season number, (4) episode number within season, (5) user rating, (6) the number of votes
  • Here is a sample of 4 episodes from the Game of Thrones series.
  • If the episode title contains a comma, the whole title is in quotation marks.
1,"Dark Wings, Dark Words",22,3,2,8.6,12714
1,No One,58,6,8,8.3,20709
1,Battle of the Bastards,59,6,9,9.9,138353
1,The Winds of Winter,60,6,10,9.9,93680

Notes

  • A different version of this data was used already in the lecture on Perl.
  • A Python tutorial shows several scripts for processing these files which are basically longer versions of some of the SQL queries below.

SQL and SQLite

Creating a database

SQLite3 database is a file with your data stored in a special format. To load our csv files to a SQLite database, run command:

sqlite3 series.db < create_db.sql

Contents of create_db.sql file:

CREATE TABLE series (
  id INT,
  title TEXT,
  channel TEXT
);
.mode csv
.import series.csv series
CREATE TABLE episodes (
  seriesId INT,
  title TEXT,
  orderInSeries INT,
  season INT,
  orderInSeason INT,
  rating REAL,
  votes INT
);
.mode csv
.import episodes.csv episodes
  • The two CREATE TABLE commands create two tables named series and episodes.
  • For each column (attribute) of the table we list its name and type.
  • Commands starting with a dot are special SQLite3 commands, not part of SQL itself. Command .import reads a CSV file and stores it in a table.

Other useful SQLite3 commands;

  • .schema tableName (lists columns of a given table)
  • .mode column and .headers on (turn on human-friendly formatting, not good for further processing)

SQL queries

  • Run sqlite3 series.db to get an SQLite command-line where you can interact with your database.
  • Then type the queries below which illustrate the basic features of SQL.
  • In these queries, we use uppercase for SQL keywords and lowercase for our names of tables and columns (SQL keywords are not case sensitive).
/*  switch on human-friendly formatting */
.mode column
.headers on

/* print title of each series (as prog1.py) */
SELECT title FROM series;

/* sort titles alphabetically */
SELECT title FROM series ORDER BY title;

/* find the highest vote number among episodes */
SELECT MAX(votes) FROM episodes;

/* find the episode with the highest number of votes, as prog3.py */
SELECT title, votes FROM episodes
  ORDER BY votes DESC LIMIT 1;

/* print all episodes with at least 50k votes, order by votes */
SELECT title, votes FROM episodes
  WHERE votes>50000 ORDER BY votes DESC;

/* join series and episodes tables, print 10 episodes
 * with the highest number of votes */
SELECT s.title, e.title, votes
  FROM episodes AS e, series AS s
  WHERE e.seriesId=s.id
  ORDER BY votes DESC 
  LIMIT 10;

/* compute the number of series per channel, as prog2.py */
SELECT channel, COUNT() AS series_count
  FROM series GROUP BY channel;

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Parts of a typical SQL query:

  • SELECT followed by column names, or functions MAX, COUNT etc. These columns or expressions are printed for each row of the table, unless filtered out (see below). Individual columns of the output can be given aliases by keyword AS.
  • FROM followed by a list of tables. Tables also can get aliases (FROM episodes AS e).
  • WHERE followed by a condition used for filtering the rows.
  • ORDER BY followed by an expression used for sorting the rows.
  • LIMIT followed by the maximum number of rows to print.

More complicated concepts:

  • GROUP BY allows to group rows based on common value of some columns and compute statistics per group (count, maximum, sum etc).
  • If you list two tables in FROM, you will conceptually create all pairs of rows, one from one table, one from the other. These are then typically filtered in the WHERE clause to only those that have a matching ID (for example WHERE e.seriesId=s.id in one of the queries above).

Accessing a database from Python

We will use sqlite3 library for Python to access data from the database and to process them further in the Python program.

read_db.py

  • The following script illustrates running a SELECT query and getting the results.
  • To start using a database, we first create objects called Connection and Cursor. We use the cursor to run individual SQl queries.
  • If we want to use particular values in the queries, which we have stored in variables, we use placeholders "?" in the query itself and pass the values to the execute command as a tuple.
    • sqlite3 module ensures that the values are passed to the database correctly. If we inserted the values directly into the query string, we could face problems if the values contain e.g. quotation marks leading to SQL syntax errors or even SQL injection attacks.
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('series.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

# run a select query
# supply parameters of the query using placeholders ?
threshold = 40000
cursor.execute("""SELECT title, votes FROM episodes
  WHERE votes>? ORDER BY votes desc""", (threshold,))

# retrieve results of the query
for row in cursor:
    print(f"Episode \"{row[0]}\" votes {row[1]}")
    
# close db connection
connection.close()

write_db.py

This script illustrates creating a new database containing a multiplication table.

  • When we modify the database, it is important to run commit command to actually write them to the disk.
#! /usr/bin/python3
import sqlite3

# connect to a database 
connection = sqlite3.connect('multiplication.db')
# create a "cursor" for working with the database
cursor = connection.cursor()

cursor.execute("""
CREATE TABLE mult_table (
a INT, b INT, mult INT)
""")

for a in range(1,11):
    for b in range(1,11):
        cursor.execute("INSERT INTO mult_table (a, b, mult) VALUES (?, ?, ?)",
                       (a, b, a * b))

# important: save the changes
connection.commit()
    
# close db connection
connection.close()

We can check the result by running command

sqlite3 multiplication.db "SELECT * FROM mult_table;"