1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWsql"

From MAD
Jump to navigation Jump to search
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
<!-- NOTEX -->
 
<!-- NOTEX -->
See also the [[Lpython|lecture]]
+
See also the [[Lsql|lecture]] and [[Python|Python tutorial]].
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
 
===Introduction===
 
===Introduction===
Choose one of the options:
 
* Tasks A, B1, B2, C (recommended for beginners)
 
* Tasks C, D (recommended for experienced Python/SQL programmers)
 
  
===Preparation===
+
* Task A: introduction to Python
Copy files:
+
* Tasks B1, B2: introduction to SQL
 +
* Task C: Python+SQL
 +
* Task D: SQL and optionally also Python
 +
 
 +
<!-- NOTEX -->'''If you are a beginner in Python (not DAV and BIN students!), do tasks A,B1,B2,C. Otherwise do tasks B1,B2,C,D.'''<!-- /NOTEX -->
 +
 
 +
===Preparation and submitting===
 +
Before you start working, copy files to your home folder:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
mkdir python
+
mkdir sql
cd python
+
cd sql
cp -iv /tasks/python/* .
+
cp -iv /tasks/sql/* .
 
</syntaxhighlight>
 
</syntaxhighlight>
  
The directory contains the following files:
+
The folder contains the following files:
* <tt>*.py</tt>: python scripts from the lecture, included for convenience
+
* <tt>*.py</tt>: scripts from the Python tutorial, included for convenience
 
* <tt>series.csv</tt>, <tt>episodes.csv</tt>: data files introduced in the lecture
 
* <tt>series.csv</tt>, <tt>episodes.csv</tt>: data files introduced in the lecture
 
* <tt>create_db.sql</tt>: SQL commands to create the database needed in tasks B1, B2, C, D
 
* <tt>create_db.sql</tt>: SQL commands to create the database needed in tasks B1, B2, C, D
<!-- NOTEX -->
 
 
* <tt>protocol.txt</tt>: fill in and submit the protocol.  
 
* <tt>protocol.txt</tt>: fill in and submit the protocol.  
<!-- /NOTEX -->
+
 
 +
Submit by copying requested files to <tt>/submit/sql/username/</tt> as follows (use the version for beginners or non-beginners):
 +
 
 +
<syntaxhighlight lang="bash">
 +
# for beginners in Python:
 +
cp -ipv protocol.txt taskA.py taskA.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db /submit/sql/username/
 +
 
 +
# for non-beginners in Python:
 +
cp -ipv protocol.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db taskD.py taskD.sql taskD.txt /submit/sql/username/
 +
# one of taskD.py and taskD.sql will be missing
 +
# this is ok
 +
</syntaxhighlight>
  
 
===Task A (Python)===
 
===Task A (Python)===
Line 33: Line 47:
 
The number of episodes for channel "HBO" is 76
 
The number of episodes for channel "HBO" is 76
 
</pre>
 
</pre>
<!-- NOTEX -->
 
 
'''Submit''' file <tt>taskA.py</tt> with your script and the output file <tt>taskA.txt</tt>:
 
'''Submit''' file <tt>taskA.py</tt> with your script and the output file <tt>taskA.txt</tt>:
<!-- /NOTEX -->
 
  
 
Hints:
 
Hints:
* A good place to start is <tt>prog4.py</tt> with reading both csv files and <tt>prog2.py</tt> with a dictionary of counters
+
* A good place to start is <tt>prog4.py</tt> with reading both csv files and <tt>prog2.py</tt> with a dictionary of counters.
* It might be useful to build a dictionary linking the series id to the channel name for that series
+
* It might be useful to build a dictionary linking the series id to the channel name for that series.
  
 
===Task B1 (SQL)===
 
===Task B1 (SQL)===
To prepare the database for tasks B1, B2, C and D, run the command:
+
To '''prepare the database''' for tasks B1, B2, C and D, run the command:
 
<syntaxhighlight lang="bash">
 
<syntaxhighlight lang="bash">
 
sqlite3 series.db < create_db.sql
 
sqlite3 series.db < create_db.sql
Line 56: Line 68:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
The [[Lpython#SQL_queries|last query in the lecture]] counts the number of episodes and average rating per each season of each series:
+
The '''task''' here is to extend the [[Lsql#SQL_queries|last query in the lecture]], which counts the number of episodes and average rating per each season of each series:
 
<syntaxhighlight lang="sql">
 
<syntaxhighlight lang="sql">
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
 
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
Line 75: Line 87:
 
True Detective  HBO        2          8              8.25       
 
True Detective  HBO        2          8              8.25       
 
</pre>
 
</pre>
<!-- NOTEX -->
 
 
'''Submit'''  <tt>taskB1.sql</tt> and <tt>taskB1.txt</tt>
 
'''Submit'''  <tt>taskB1.sql</tt> and <tt>taskB1.txt</tt>
<!-- /NOTEX -->
 
  
 
===Task B2 (SQL)===
 
===Task B2 (SQL)===
For each channel compute the total count and average rating of all their episodes. Write your SQL query to file <tt>taskB2.sql</tt>. As before, the first two lines of the file should be
+
For each channel, compute the total count and average rating of all their episodes. Write your SQL query to file <tt>taskB2.sql</tt>. As before, the first two lines of the file should be
 
<syntaxhighlight lang="sql">
 
<syntaxhighlight lang="sql">
 
.mode column
 
.mode column
Line 93: Line 103:
 
HBO        76          8.98947368421053
 
HBO        76          8.98947368421053
 
</pre>
 
</pre>
<!-- NOTEX -->
 
 
'''Submit'''  <tt>taskB2.sql</tt> and <tt>taskB2.txt</tt>
 
'''Submit'''  <tt>taskB2.sql</tt> and <tt>taskB2.txt</tt>
<!-- /NOTEX -->
 
  
 
===Task C (Python+SQL)===
 
===Task C (Python+SQL)===
 
If you have not done so already, create an SQLite database, as explained at the beginning of [[#Task_B1 (SQL)|task B1]].
 
If you have not done so already, create an SQLite database, as explained at the beginning of [[#Task_B1 (SQL)|task B1]].
  
Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called <tt>seasons</tt> in the <tt>series.db</tt> database file
+
Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called <tt>seasons</tt> in the <tt>series.db</tt> database file.
 
<syntaxhighlight lang="sql">
 
<syntaxhighlight lang="sql">
 
/* print the number of episodes and average rating per season and series */
 
/* print the number of episodes and average rating per season and series */
Line 106: Line 114:
 
   FROM episodes GROUP BY seriesId, season;
 
   FROM episodes GROUP BY seriesId, season;
 
</syntaxhighlight>
 
</syntaxhighlight>
* SQL can store the results from a query directly in a table, but in this task you should instead read each row of the <tt>SELECT</tt> query in Python and to store it by running <tt>INSERT</tt> command from Python
+
* SQL can store the results from a query directly in a table, but in this task you should instead read each row of the <tt>SELECT</tt> query in Python and to store it by running <tt>INSERT</tt> command from Python.
* Also do not forget to create the new table in the database with appropriate column names and types. You can execute <tt>CREATE TABLE</tt> command from Python
+
* At the start of your script, execute SQL command <tt>DROP TABLE  IF EXISTS seasons</tt>, which will erase attempts from previous runs of your script. Afterwards execute SQL <tt>CREATE TABLE</tt> command for creating <tt>seasons</tt> table with appropriate column names and types.
* The cursor from the <tt>SELECT</tt> query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
+
* The cursor from the <tt>SELECT</tt> query is needed while you iterate over the results. Therefore create two cursors: one for reading the database and one for writing.
* If you change your database during debugging, you can start over by running the command for creating the database above
+
** The SQLite3 library in Python has some [https://docs.python.org/3/library/sqlite3.html#transaction-control implicit transactions], so you may end up with some locking errors. We recommend the following order of commands: (1) create connection and two cursors, (2) run <tt>DROP TABLE</tt> and <tt>CREATE TABLE</tt> commands, commit changes, (3) run <tt>SELECT</tt> command on one of the cursors, (4) as you iterate through results of <tt>SELECT</tt>, run <tt>INSERT</tt> on the second cursor, (4) commit changes.  You could also commit after each <tt>INSERT</tt>.  
 +
* If you damage your database during debugging, you can start over by again running the command for creating the database as in [[#Task_B1 (SQL)|task B1]].
 
* Store the script as <tt>taskC.py</tt>.  
 
* Store the script as <tt>taskC.py</tt>.  
  
Line 118: Line 127:
 
This will print many lines, including this one: <tt>"5|1|8|9.3"</tt> which is for season 1 of series 5 (True Detective).
 
This will print many lines, including this one: <tt>"5|1|8|9.3"</tt> which is for season 1 of series 5 (True Detective).
  
<!-- NOTEX -->
 
 
'''Submit''' your script <tt>taskC.py</tt> and the modified database <tt>series.db</tt>.
 
'''Submit''' your script <tt>taskC.py</tt> and the modified database <tt>series.db</tt>.
<!-- /NOTEX -->
 
  
 
===Task D (SQL, optionally Python)===
 
===Task D (SQL, optionally Python)===
Line 126: Line 133:
 
* For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
 
* For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
 
* Print a table containing series name, season number x, average rating in season x and average rating in season x+1
 
* Print a table containing series name, season number x, average rating in season x and average rating in season x+1
* The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons to the highest drop.
+
* The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
* One option is to run a query in SQL in which you join the <tt>seasons</tt> table from task C with itself and select rows that belong to the same series and successive seasons
+
* One option is to run a query in SQL in which you join the <tt>seasons</tt> table from task C with itself and select rows that belong to the same series and successive seasons.
* You can also read the rows of the <tt>seasons</tt> table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
+
* You can also read the rows of the <tt>seasons</tt> table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting.
<!-- NOTEX -->
 
 
* '''Submit''' your code as <tt>taskD.py</tt> or <tt>taskD.sql</tt> and the resulting table as <tt>taskD.txt</tt>
 
* '''Submit''' your code as <tt>taskD.py</tt> or <tt>taskD.sql</tt> and the resulting table as <tt>taskD.txt</tt>
<!-- /NOTEX -->
 
  
 
The output should start like this (the formatting may differ):
 
The output should start like this (the formatting may differ):

Latest revision as of 08:42, 14 March 2024

See also the lecture and Python tutorial.

Introduction

  • Task A: introduction to Python
  • Tasks B1, B2: introduction to SQL
  • Task C: Python+SQL
  • Task D: SQL and optionally also Python

If you are a beginner in Python (not DAV and BIN students!), do tasks A,B1,B2,C. Otherwise do tasks B1,B2,C,D.

Preparation and submitting

Before you start working, copy files to your home folder:

mkdir sql
cd sql
cp -iv /tasks/sql/* .

The folder contains the following files:

  • *.py: scripts from the Python tutorial, included for convenience
  • series.csv, episodes.csv: data files introduced in the lecture
  • create_db.sql: SQL commands to create the database needed in tasks B1, B2, C, D
  • protocol.txt: fill in and submit the protocol.

Submit by copying requested files to /submit/sql/username/ as follows (use the version for beginners or non-beginners):

# for beginners in Python:
cp -ipv protocol.txt taskA.py taskA.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db /submit/sql/username/

# for non-beginners in Python:
cp -ipv protocol.txt taskB1.sql taskB1.txt taskB2.sql taskB2.txt taskC.py series.db taskD.py taskD.sql taskD.txt /submit/sql/username/
# one of taskD.py and taskD.sql will be missing
# this is ok

Task A (Python)

Write a script taskA.py which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:

./taskA.py > taskA.txt

One of the lines of your output should be:

The number of episodes for channel "HBO" is 76

Submit file taskA.py with your script and the output file taskA.txt:

Hints:

  • A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters.
  • It might be useful to build a dictionary linking the series id to the channel name for that series.

Task B1 (SQL)

To prepare the database for tasks B1, B2, C and D, run the command:

sqlite3 series.db < create_db.sql

To verify that your database was created correctly, you can run the following commands:

sqlite3 series.db ".tables"
# output should be  episodes  series  

sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10

The task here is to extend the last query in the lecture, which counts the number of episodes and average rating per each season of each series:

SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;

Use join with the series table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file taskB1.sql. The first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB1.sql > taskB1.txt

For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25

True Detective   HBO         1           8              9.3       
True Detective   HBO         2           8              8.25      

Submit taskB1.sql and taskB1.txt

Task B2 (SQL)

For each channel, compute the total count and average rating of all their episodes. Write your SQL query to file taskB2.sql. As before, the first two lines of the file should be

.mode column
.headers on

Run your query as follows:

sqlite3 series.db < taskB2.sql > taskB2.txt

For example, all 76 episodes for the two HBO series have average rating as follows:

HBO         76          8.98947368421053

Submit taskB2.sql and taskB2.txt

Task C (Python+SQL)

If you have not done so already, create an SQLite database, as explained at the beginning of task B1.

Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database file.

/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • SQL can store the results from a query directly in a table, but in this task you should instead read each row of the SELECT query in Python and to store it by running INSERT command from Python.
  • At the start of your script, execute SQL command DROP TABLE IF EXISTS seasons, which will erase attempts from previous runs of your script. Afterwards execute SQL CREATE TABLE command for creating seasons table with appropriate column names and types.
  • The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors: one for reading the database and one for writing.
    • The SQLite3 library in Python has some implicit transactions, so you may end up with some locking errors. We recommend the following order of commands: (1) create connection and two cursors, (2) run DROP TABLE and CREATE TABLE commands, commit changes, (3) run SELECT command on one of the cursors, (4) as you iterate through results of SELECT, run INSERT on the second cursor, (4) commit changes. You could also commit after each INSERT.
  • If you damage your database during debugging, you can start over by again running the command for creating the database as in task B1.
  • Store the script as taskC.py.

To check that your table was created, you can run command

sqlite3 series.db "SELECT * FROM seasons;"

This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective).

Submit your script taskC.py and the modified database series.db.

Task D (SQL, optionally Python)

For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.

  • For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
  • Print a table containing series name, season number x, average rating in season x and average rating in season x+1
  • The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
  • One option is to run a query in SQL in which you join the seasons table from task C with itself and select rows that belong to the same series and successive seasons.
  • You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting.
  • Submit your code as taskD.py or taskD.sql and the resulting table as taskD.txt

The output should start like this (the formatting may differ):

series      season x    rating for x  rating for x+1  
----------  ----------  ------------  ----------------
Sherlock    1           8.825         9.26666666666667
Breaking B  4           9.0           9.375           

When using SQL without Python, include the following two lines in taskD.sql

.mode column
.headers on

and run your query as sqlite3 series.db < taskD.sql > taskD.txt