1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Python
Jump to navigation
Jump to search
This page contains a brief Python tutorial for students who are not very familiar with the language.
Other resources:
We will illustrate basic features of Python on several scripts working with files series.csv and episodes.csv introduced in the lecture in SQL.
- In the first two examples we just use standard Python functions for reading files and split lines into columns by split command.
- This does not work well for episodes.csv file where comma sometimes separates columns and sometimes is in quotes within a column. Therefore we use csv module, which is one the the standard Python modules.
- Alternatively, one could import CSV files via more complex libraries, such as Pandas.
- In lecture on SQL similar tasks as these scripts can be done by very short SQL commands. In Pandas we could also achieve similar results using a few commands.
Contents
prog1.py
The first script prints the second column (series title) from series.csv
#! /usr/bin/python3
# open a file for reading
with open('series.csv') as csvfile:
# iterate over lines of the input file
for line in csvfile:
# split a line into columns at commas
columns = line.split(",")
# print the second column
print(columns[1])
- Python uses indentation to delimit blocks. In this example, the with command starts a block and within this block the for command starts another block containing commands columns=... and print. The body of each block is indented several spaces relative to the enclosing block.
- Variables are not declared, but directly used. This program uses variables csvfile, line, columns.
- The open command opens a file (here for reading, but other options are available).
- The with command opens the file, stores the file handle in csvfile variable, executes all commands within its block and finally closes the file.
- The for-loop iterates over all lines in the file, assigning each in variable line and executing the body of the block.
- Method split of the built-in string type str splits the line at every comma and returns a list of strings, one for every column of the table (see also other string methods)
- The final line prints the second column and the end of line character.
prog2.py
The following script prints the list of series of each TV channel.
- For illustration we also separately count the number of the series for each channel, but the count could be also obtained as the length of the list.
#! /usr/bin/python3
from collections import defaultdict
# Create a dictionary in which default value
# for non-existent key is 0 (type int)
# For each channel we will count the series
channel_counts = defaultdict(int)
# Create a dictionary for keeping a list of series per channel
# default value empty list
channel_lists = defaultdict(list)
# open a file and iterate over lines
with open('series.csv') as csvfile:
for line in csvfile:
# strip whitespace (e.g. end of line) from end of line
line = line.rstrip()
# split line into columns, find channel and series names
columns = line.split(",")
channel = columns[2]
series = columns[1]
# increase counter for channel
channel_counts[channel] += 1
# add series to list for the channel
channel_lists[channel].append(series)
# print counts
print("Counts:")
for (channel, count) in channel_counts.items():
print(f"Channel \"{channel}\" has {count} series.")
# print series lists
print("\nLists:")
for channel in channel_lists:
list = ", ".join(channel_lists[channel])
print("Series for channel \"%s\": %s" % (channel,list))
- In this script, we use two dictionaries (maps, associative arrays), both having channel names as keys. Dictionary channel_counts stores the number of series, channel_lists stores the list of series names.
- For simplicity we use a library data structure called defaultdict instead of a plain python dictionary. The reason is easier initialization: keys do not need to be explicitly inserted to the dictionary, but are initialized with a default value at the first access.
- Reading of the input file is similar to the previous script.
- Afterwards we iterate through the keys of both dictionaries and print both the keys and the values.
- We format the output string using f-strings f"..." where expressions in { } are evaluated and substituted to the string. Formatting similar to C-style printf, e.g. print(f"{2/3:.3f}").
- Notice that when we print counts, we iterate through pairs (channel, count) returned by channel_counts.items(), while when we print series, we iterate through keys of the dictionary.
prog3.py
This script finds the episode with the highest number of votes among all episodes.
- We use a library for csv parsing to deal with quotation marks around episode names with commas, such as "Dark Wings, Dark Words".
- This is done by first opening a file and then passing it as an argument to csv.reader, which returns a reader object used to iterate over rows.
#! /usr/bin/python3
import csv
# keep maximum number of votes and its episode
max_votes = 0
max_votes_episode = None
# open a file
with open('episodes.csv') as csvfile:
# create a reader for parsing csv files
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# iterate over rows already split into columns
for row in reader:
votes = int(row[6])
if votes > max_votes:
max_votes = votes
max_votes_episode = row[1]
# print result
print(f"Maximum votes {max_votes} in episode \"{max_votes_episode}\"")
prog4.py
The following script shows an example of function definition.
- The function reads a whole csv file into a 2d array.
- The rest of the program calls this function twice for each of the two files.
- This could be followed by some further processing of these 2d arrays.
#! /usr/bin/python3
import csv
def read_csv_to_list(filename):
# create empty list
rows = []
# open a file
with open(filename) as csvfile:
# create a reader for parsing csv files
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
# iterate over rows already split into columns
for row in reader:
rows.append(row)
return rows
series = read_csv_to_list('series.csv')
episodes = read_csv_to_list('episodes.csv')
print(f"The number of episodes is {len(episodes)}.")
# further processing of series and episodes...