1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWsql"
Line 22: | Line 22: | ||
<!-- NOTEX --> | <!-- NOTEX --> | ||
* <tt>protocol.txt</tt>: fill in and submit the protocol. | * <tt>protocol.txt</tt>: fill in and submit the protocol. | ||
+ | <!-- /NOTEX --> | ||
+ | |||
+ | <!-- NOTEX --> | ||
+ | Submit by copying requested files to <tt>/submit/python/username/</tt> | ||
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
Revision as of 16:10, 12 March 2020
See also the lecture
Contents
Introduction
Choose one of the options:
- Tasks A, B1, B2, C (recommended for beginners)
- Tasks C, D (recommended for experienced Python/SQL programmers)
Preparation
Copy files:
mkdir python
cd python
cp -iv /tasks/python/* .
The directory contains the following files:
- *.py: python scripts from the lecture, included for convenience
- series.csv, episodes.csv: data files introduced in the lecture
- create_db.sql: SQL commands to create the database needed in tasks B1, B2, C, D
- protocol.txt: fill in and submit the protocol.
Submit by copying requested files to /submit/python/username/
Task A (Python)
Write a script taskA.py which reads both csv files and outputs for each TV channel the total number of episodes in their series combined. Run your script as follows:
./taskA.py > taskA.txt
One of the lines of your output should be:
The number of episodes for channel "HBO" is 76
Submit file taskA.py with your script and the output file taskA.txt:
Hints:
- A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters
- It might be useful to build a dictionary linking the series id to the channel name for that series
Task B1 (SQL)
To prepare the database for tasks B1, B2, C and D, run the command:
sqlite3 series.db < create_db.sql
To verify that your database was created correctly, you can run the following commands:
sqlite3 series.db ".tables"
# output should be episodes series
sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10
The last query in the lecture counts the number of episodes and average rating per each season of each series:
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
FROM episodes GROUP BY seriesId, season;
Use join with the series table to replace the numeric series id with the series title and add the channel name. Write your SQL query to file taskB1.sql. The first two lines of the file should be
.mode column
.headers on
Run your query as follows:
sqlite3 series.db < taskB1.sql > taskB1.txt
For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
True Detective HBO 1 8 9.3 True Detective HBO 2 8 8.25
Submit taskB1.sql and taskB1.txt
Task B2 (SQL)
For each channel compute the total count and average rating of all their episodes. Write your SQL query to file taskB2.sql. As before, the first two lines of the file should be
.mode column
.headers on
Run your query as follows:
sqlite3 series.db < taskB2.sql > taskB2.txt
For example, all 76 episodes for the two HBO series have average rating as follows:
HBO 76 8.98947368421053
Submit taskB2.sql and taskB2.txt
Task C (Python+SQL)
If you have not done so already, create an SQLite database, as explained at the beginning of task B1.
Write a Python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database file
/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
FROM episodes GROUP BY seriesId, season;
- SQL can store the results from a query directly in a table, but in this task you should instead read each row of the SELECT query in Python and to store it by running INSERT command from Python
- Also do not forget to create the new table in the database with appropriate column names and types. You can execute CREATE TABLE command from Python
- The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
- If you change your database during debugging, you can start over by running the command for creating the database above
- Store the script as taskC.py.
To check that your table was created, you can run command
sqlite3 series.db "SELECT * FROM seasons;"
This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective).
Submit your script taskC.py and the modified database series.db.
Task D (SQL, optionally Python)
For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased.
- For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is 0.44166666666667
- Print a table containing series name, season number x, average rating in season x and average rating in season x+1
- The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons with the highest drop.
- One option is to run a query in SQL in which you join the seasons table from task C with itself and select rows that belong to the same series and successive seasons
- You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
- Submit your code as taskD.py or taskD.sql and the resulting table as taskD.txt
The output should start like this (the formatting may differ):
series season x rating for x rating for x+1 ---------- ---------- ------------ ---------------- Sherlock 1 8.825 9.26666666666667 Breaking B 4 9.0 9.375
When using SQL without Python, include the following two lines in taskD.sql
.mode column
.headers on
and run your query as sqlite3 series.db < taskD.sql > taskD.txt