2-INF-185 Integrácia dátových zdrojov 2018/19

Materiály · Úvod · Pravidlá · Kontakt
· Od 14.3. sa presúvame do učebne F2-T3, a.k.a. F2-128.
· Body z už opravených úloh nájdete na serveri v /grades/userid.txt
· Do stredy 17.4. odovzdajte návrh projektu vo formáte .txt alebo .pdf do adresára /submit/navrh/username
  (príklady projektov pre bioinformatikov)


HW04

From IDZ
Jump to: navigation, search

Lecture 04

Introduction

Choose one of the options:

  • Tasks A, B1, B2, C (recommended for beginners)
  • Tasks C, D (recommended for experienced Python/SQL programmers)

Preparation

Copy files:

mkdir hw04
cd hw04
cp -iv /tasks/hw04/* .

The directory contains the following files:

  • *.py: python scripts from the lecture, included only for convenience
  • series.csv, episodes.csv: data file used in the homework (and the lecture)
  • create_db.sql: sql commands to create the database needed in tasks B1, B2, C, D
  • protocol.txt: fill in and submit the protocol. Only "Vyhodnotenie" and "Pouzite zdroje" are needed this time

Task A

  • Write a script which reads both csv files and outputs for each TV channel the total number of episodes in their series combined
  • Submit file taskA.py with your script
  • Run your script as follows and submit the file taskA.txt:
./taskA.py > taskA.txt
  • One of the lines of your output should be:
The number of episodes for channel "HBO" is 76

Hints:

  • A good place to start is prog4.py with reading both csv files and prog2.py with a dictionary of counters
  • It might be useful to build a dictionary linking the series id to the channel name for that series

Task B1

  • To prepare the database for tasks B1, B2 and C, run the command:
sqlite3 series.db < create_db.sql

To verify that your database was created correctly, you can run the following commands:

sqlite3 series.db ".tables"
# output should be  episodes  series  

sqlite3 series.db "select count() from episodes; select count() from series;"
# output should be 348 and 10
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • Use join with the series table to replace the numeric series id with the series title and add the channel name
  • Write your SQL query to file taskB1.sql and submit this file
    • The first two lines of the sql file should be
.mode column
.headers on
  • Run your query as follows:
sqlite3 series.db < taskB1.sql > taskB1.txt
  • Submit also the resulting file taskB1.txt
  • For example, both seasons of True Detective by HBO have 8 episodes and average ratings 9.3 and 8.25
True Detective   HBO         1           8              9.3       
True Detective   HBO         2           8              8.25      

Task B2

  • For each channel compute the total count and average rating of all their episodes.
  • Write your SQL query to file taskB2.sql and submit this file
    • The first two lines of the sql file should be
.mode column
.headers on
  • Run your query as follows:
sqlite3 series.db < taskB2.sql > taskB2.txt
  • Submit also the resulting file taskB2.txt
  • For example, all 76 episodes for the two HBO series have average rating as follows:
HBO         76          8.98947368421053

Task C

  • If you have not done so already, create an SQLite database, as explained at the beginning of task B1.
  • Write a python script that runs the last query from the lecture (shown below) and stores its results in a separate table called seasons in the series.db database
/* print the number of episodes and average rating per season and series */
SELECT seriesId, season, COUNT() AS episode_count, AVG(rating) AS rating
  FROM episodes GROUP BY seriesId, season;
  • SQL can store results from a query directly in a table, but in this task you should instead read each row of the SELECT query in python and to store it by running INSERT command from python
  • Also do not forget to create the new table in the database with appropriate column names and types. You can execute CREATE TABLE command from python
  • The cursor from the SELECT query is needed while you iterate over the results. Therefore create two cursors - one for reading the database and one for writing.
  • If you change your database during debugging, you can start over by running the command for creating the database above
  • Store and submit the script in taskC.py. Also submit the modified database series.db
  • To check that your table was created, you can run command
sqlite3 series.db "SELECT * FROM seasons;"
  • This will print many lines, including this one: "5|1|8|9.3" which is for season 1 of series 5 (True Detective)

Task D

  • For each pair of consecutive seasons within each series, compute how much has the average rating increased or decreased
  • For example in the Sherlock series, season 1 had rating 8.825 and season 2 rating 9.26666666666667, and thus the difference in ratings is -0.44166666666667
  • Print a table containing series name, season number x, average rating in season x and average rating in season x+1
  • The table should be ordered by the difference between the last two columns, i.e. from seasons with the highest increase to seasons to the highest drop.
  • One option is to run a query in SQL in which you join table seasons from task C with itself and select rows that belong to the same series and successive seasons
  • You can also read the rows of the seasons table in Python, combine information from rows for successive seasons of the same series and create the final report by sorting
  • Submit your code as taskD.py or taskD.sql
  • Submit the resulting table as taskD.txt

The output should start like this (the formatting may differ):

series      season x    rating for x  rating for x+1  
----------  ----------  ------------  ----------------
Sherlock    1           8.825         9.26666666666667
Breaking B  4           9.0           9.375           

When using sql without python, include the following two lines in taskD.sql

.mode column
.headers on

and run your query as sqlite3 series.db < taskD.sql > taskD.txt