1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt


Python

From MAD
Jump to navigation Jump to search

This page contains a brief Python tutorial for students who are not very familiar with the language.

Other resources:

We will illustrate basic features of Python on several scripts working with files series.csv and episodes.csv introduced in the lecture in SQL.

  • In the first two examples we just use standard Python functions for reading files and split lines into columns by split command.
  • This does not work well for episodes.csv file where comma sometimes separates columns and sometimes is in quotes within a column. Therefore we use csv module, which is one the the standard Python modules.
  • Alternatively, one could import CSV files via more complex libraries, such as Pandas.
  • In lecture on SQL similar tasks as these scripts can be done by very short SQL commands. In Pandas we could also achieve similar results using a few commands.

prog1.py

The first script prints the second column (series title) from series.csv

#! /usr/bin/python3

# open a file for reading
with open('series.csv') as csvfile:
    # iterate over lines of the input file
    for line in csvfile:
        # split a line into columns at commas
        columns = line.split(",")
        # print the second column
        print(columns[1])
  • Python uses indentation to delimit blocks. In this example, the with command starts a block and within this block the for command starts another block containing commands columns=... and print. The body of each block is indented several spaces relative to the enclosing block.
  • Variables are not declared, but directly used. This program uses variables csvfile, line, columns.
  • The open command opens a file (here for reading, but other options are available).
  • The with command opens the file, stores the file handle in csvfile variable, executes all commands within its block and finally closes the file.
  • The for-loop iterates over all lines in the file, assigning each in variable line and executing the body of the block.
  • Method split of the built-in string type str splits the line at every comma and returns a list of strings, one for every column of the table (see also other string methods)
  • The final line prints the second column and the end of line character.

prog2.py

The following script prints the list of series of each TV channel.

  • For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list.
#! /usr/bin/python3
from collections import defaultdict

# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)

# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)

# open a file and iterate over lines
with open('series.csv') as csvfile:
    for line in csvfile:
        # strip whitespace (e.g. end of line) from end of line
        line = line.rstrip()
        # split line into columns, find channel and series names
        columns = line.split(",")
        channel = columns[2]
        series = columns[1]
        # increase counter for channel
        channel_counts[channel] += 1
        # add series to list for the channel
        channel_lists[channel].append(series)

# print counts
print("Counts:")
for (channel, count) in channel_counts.items():
    print(f"Channel \"{channel}\" has {count} series.")

# print series lists
print("\nLists:")
for channel in channel_lists:
    list = ", ".join(channel_lists[channel]) 
    print("Series for channel \"%s\": %s" % (channel,list))
  • In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary channel_counts stores the number of series, channel_lists stores the list of series names.
  • For simplicity we use a library data structure called defaultdict instead of a plain python dictionary. The reason is easier initialization: keys do not need to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
  • Reading of the input file is similar to the previous script.
  • Afterwards we iterate through the keys of both dictionaries and print both the keys and the values.
  • We format the output string using f-strings f"..." where expressions in { } are evaluated and substituted to the string. Formatting similar to C-style printf, e.g. print(f"{2/3:.3f}").
  • Notice that when we print counts, we iterate through pairs (channel, count) returned by channel_counts.items(), while when we print series, we iterate through keys of the dictionary.

prog3.py

This script finds the episode with the highest number of votes among all episodes.

  • We use a library for csv parsing to deal with quotation marks around episode names with commas, such as "Dark Wings, Dark Words".
  • This is done by first opening a file and then passing it as an argument to csv.reader, which returns a reader object used to iterate over rows.
#! /usr/bin/python3
import csv

# keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None

# open a file
with open('episodes.csv') as csvfile:
    # create a reader for parsing csv files
    reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    # iterate over rows already split into columns
    for row in reader:
        votes = int(row[6])
        if votes > max_votes:
            max_votes = votes
            max_votes_episode = row[1]
        
# print result
print(f"Maximum votes {max_votes} in episode \"{max_votes_episode}\"")

prog4.py

The following script shows an example of function definition.

  • The function reads a whole csv file into a 2d array.
  • The rest of the program calls this function twice for each of the two files.
  • This could be followed by some further processing of these 2d arrays.
#! /usr/bin/python3
import csv

def read_csv_to_list(filename):
    # create empty list
    rows = []
    # open a file
    with open(filename) as csvfile:
        # create a reader for parsing csv files
        reader = csv.reader(csvfile, delimiter=',', quotechar='"')
        # iterate over rows already split into columns
        for row in reader:
            rows.append(row)
    return rows

series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print(f"The number of episodes is {len(episodes)}.")
# further processing of series and episodes...