2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z HW01 a HW04 nájdete na serveri v /grades/userid.txt


HW05

From IDZ
Jump to: navigation, search

Lecture 05

Preparation

Copy files:

mkdir hw05
cd hw05
cp -iv /tasks/hw05/* .

The directory contains the following files:

  • *.py: python scripts for the lecture, included only for convenience
  • series.csv, episodes.csv: data file used in the homework (and the lecture)
  • create_db.sql: sql commands to create the database needed in tasks B1, B2, C
  • protocol.txt: fill in and submit the protocol. Only "Vyhodnotenie" and "Pouzite zdroje" are needed this time

To prepare the database for tasks B1, B2 and C, run the command:

sqlite3 series.db < create_db.sql

To verify that your database was created correctly, you can run the following commands:

sqlite3 series.db ".tables"
# output should be  episodes  series  

sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10

Task A

  • Write a script which reads both csv files and outputs for each TV channel the total number of episodes in their series combined
  • Submit file taskA.py with your script
  • Run your script as follows and submit the file taskA.txt:
./taskA.py > taskA.txt
  • One of the lines of your output should be:
The number of episodes for channel "HBO" is 76

Hints:

  • A good place to start is prog4.py with reading both csv files and prog3.py with dictionary of counters
  • It might be useful to build a dictionary linking series id to the channel name for that series

Task B1

  • Prepare your database as shown above
  • The last query in the lecture counts the number of episodes and average rating per each season of each series
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • Use join with series table to replace numeric series id with series title and add the channel name
  • Write your SQL query to file taskB1.sql and submit this file
    • The first two lines of the sql file should be
.mode column
.headers on
  • Run your query as follows:
sqlite3 series.db < taskB1.sql > taskB1.txt
  • Submit also the resulting file taskB1.txt
  • For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
True Detective   HBO         1           8              9.3       
True Detective   HBO         2           8              8.25      

Task B2

  • For each channel compute the total count and average rating of all their episodes.
  • Write your SQL query to file taskB2.sql and submit this file
    • The first two lines of the sql file should be
.mode column
.headers on
  • Run your query as follows:
sqlite3 series.db taskB2.sql > taskB2.txt
  • Submit also the resulting file taskB2.txt
  • For example, all 76 episodes for the two HBO series have average rating as follows:
HBO         76          8.98947368421053

Task C

  • Write a python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database
/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • SQL can store results from a query directly in a table, but in this task you should instead read each row of the SELECT query in python and to store it by running INSERT command from python
  • Also do not forget to create the new table in the database with appropriate column names and types. You can execute CREATE TABLE command from python
  • The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
  • If you change you database during debugging, you can start over by running the command for creating the database above
  • Store and submit the script in taskC.py. Also submit the modified database series.db

Further possibilities

  • If you want to practice Python and SQL some more, you can try this task. Do not submit it.
  • Find all series in which there was a drop in ratings from one season to the next more than 0.5
    • For example in task B1, we have seen drop of 9.3-8.25=1.05 in the True Detective series
  • Analogously you could find series with big increases in the successive seasons
  • One option is to run a query in SQL in which you join table seasons from task C with itself and select rows that belong to the same series and successive seasons
  • Another option is to iterate over all rows of seasons table in Python and to find the answer by comparing rows for successive seasons of the same series