1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration

Materials · Introduction · Rules · Contact
· Grades from marked homeworks are on the server in file /grades/userid.txt
· Dates of project submission and oral exams:
Early: submit project May 24 9:00am, oral exams May 27 1:00pm (limit 5 students).
Otherwise submit project June 11, 9:00am, oral exams June 18 and 21 (estimated 9:00am-1:00pm, schedule will be published before exam).
Sign up for one the exam days in AIS before June 11.
Remedial exams will take place in the last week of the exam period. Beware, there will not be much time to prepare a better project. Projects should be submitted as homeworks to /submit/project.
· Cloud homework is due on May 20 9:00am.


Difference between revisions of "HWcpp"

From MAD
Jump to navigation Jump to search
 
(14 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
<!-- /NOTEX -->
 
<!-- /NOTEX -->
  
In both tasks, you should implement all functionalities in two versions:
+
== Overview ==
  
* Pure Python
+
We will be solving a problem where we are given a list of documents. Each document consists of multiple words separated by space. We will assign each document an ID 0,1,... in the order they appear on input. We want to implement an indexing data structure that allows the following operations:
* Python interface with C++ implementation
 
  
Then, you should compare the running speed of both implementations.
+
* Function <tt>add_document</tt> adds a document to the index and returns its ID (the next available ID). A document is a list of words (strings).
 +
* Function <tt>search</tt> finds all documents that contain a given word, and returns the list of their IDs in increasing order.
 +
* Function <tt>multi_search</tt> gets a list of words (W1, W2, ...) and finds all documents that contain all of the words (so each document has to contain all words from the query). It returns the list of document IDs in increasing order.
  
===Task A===
+
Note that this functionality could be a base for a very simple text search engine. We could extend it with the option for storing and printing whole documents containing given keywords of interest, but we will not do this today.
  
You are given a list of documents. Each document consists of multiple words separated by space.
+
In this homework, you are given a Python program that implements these functions using a structure called '''inverted index'''.  
You should design and implement an indexing datastructure that allows the following two operations:
+
* The structure consists of a dictionary in which keys are the words. For each word we keep a list of document IDs in the increasing order.
 +
* Function <tt>search</tt> simply returns the list from the dictionary.  
 +
* Function <tt>multi_search</tt> uses <tt>search</tt> to return a list of IDs for each input word. It maintains an intersection of lists seen so far and computes pairwise intersection of the new list and previous intersection.
 +
* Intersection of two sorted lists of IDs is done by a procedure similar to merging two sorted lists in [https://en.wikipedia.org/wiki/Merge_sort Merge Sort]. One integer index is moved along each list so that we always move forward the one that points to a smaller value. If both indices point to the same value, we found an element from the intersection which is placed into the result. As soon as one index reaches the end of the list, we are finished.
  
* Add a document to the index (your index should assign it a unique ID)
+
The overall program <tt>doc_finder.py</tt> requires two command-line arguments. The first is a file with documents, one document per line. Words in the document are separated by spaces. The second argument is a file with queries, one query per line. If the line contains a single word, it is passed to <tt>search</tt>, otherwise to <tt>multi_search</tt>. For each query, the program prints the IDs of the found documents separated by spaces. If no documents were found, it prints a dash.
* Retrieve a document with the given ID
 
* Find all documents that contain the given word
 
  
You should ideally implement this as an inverted index, i.e. for each word keep a list of documents containing it.
+
Here is an example of tiny debugging inputs and outputs:
 +
<pre>
 +
==> tiny_documents.txt <==
 +
this is document zero about frog
 +
this is document one about dog and it is longer
  
This is the recommended interface for your datastructure with its example usage:
+
==> tiny_queries.txt <==
 +
dog
 +
cat
 +
is
 +
is frog
 +
dog frog
 +
cat frog
 +
is this
  
<syntaxhighlight lang="Python">
+
==> tiny_answers.txt <==
from typing import List
+
1
 +
-
 +
0 1
 +
0
 +
-
 +
-
 +
0 1
 +
</pre>
  
class DocumentIndex:
+
=== Files ===
  def __init__(self):
 
    """Initialize what you need here"""
 
 
 
  def add_document(self, document:str) -> int:
 
    """Adds document to index, generates unique id for it"""
 
  
  def get_document(self, document_id:int) -> str:
+
Tiny debugging input and correct output:
    """Returns document for given ID"""
+
* <tt>/tasks/cpp/tiny_documents.txt</tt> a file with documents
 
+
* <tt>/tasks/cpp/tiny_queries.txt</tt> a file with queries (both single and multi-word)
  def search(self, word:str) -> List[int]:
+
* <tt>/tasks/cpp/tiny_answers.txt</tt> correct answers for the queries
    """Returns list of document IDs containing the given word"""
 
  
"""Example usage"""
+
Larger inputs:
index = DocumentIndex()
+
* <tt>/tasks/cpp/documents.txt</tt> a file with documents
 +
* <tt>/tasks/cpp/queries.txt</tt> a file with single-word queries
 +
* <tt>/tasks/cpp/multiqueries.txt</tt> a file with multi-word queries
  
index.add_document("this is document one about frog") # Returns 47
+
Program in Python:
index.add_document("this is document two about dog") # Returns 42
+
* <tt>/tasks/cpp/doc_finder.py</tt>
index.get_document(42) # Returns "this is document two about dog"
 
index.search("dog") # Returns [42]
 
index.search("cat") # Returns []
 
index.search("this") # Returns [47, 42] or [42, 47] order of documents is not relevant
 
</syntaxhighlight>
 
  
===Task B===
+
You can run the program as follows:
 +
<pre>
 +
# run the program on the tiny inputs
 +
python3 /tasks/cpp/doc_finder.py /tasks/cpp/tiny_documents.txt /tasks/cpp/tiny_queries.txt > test_answers.txt
 +
# check whether the answer is correct
 +
diff /tasks/cpp/tiny_answers.txt test_answers.txt
 +
</pre>
  
Extend you index so that it support following query:
+
==Task A==
* Given list of words (W1, W2, ...) find all documents, which contain all of the words (so each document has to contain all of the word from the query).
 
  
You should ideally build on top of functionality from previous task. For each word find documents containing it and then find overlap in the results.
+
The functions listed above are implemented in class <tt>DocumentIndex</tt> in file <tt>doc_finder.py</tt>. Reimplement this class in C++ and use it in Python.  
  
This is the recommended interface for your datastructure with its example usage:
+
* Your Python script should be named <tt>doc_finder_cpp.py</tt>. Function <tt>main</tt> from <tt>doc_finder.py</tt> can remain almost the same as in <tt>doc_finder.py</tt> except for a slightly different initialization of <tt>index</tt> variable using the C++ class. Omit the Python implementation of the class for the file.
 +
* The C++ class should be placed in file <tt>DocumentIndex.cc</tt>
 +
* Use the same algorithm in C++ as in Python. We recommend using STL structures [https://en.cppreference.com/w/cpp/container/unordered_map unorderd_map] and [https://en.cppreference.com/w/cpp/container/vector vector].
 +
* '''Submit''' files <tt>doc_finder_cpp.py</tt> and <tt>DocumentIndex.cc</tt>.
 +
* Your implementation should successfully compile and run on the tiny inputs and produce correct output (empty diff) using the commands below:
 +
<pre>
 +
# compile C++
 +
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) DocumentIndex.cc -o DocumentIndex$(python3-config --extension-suffix)
 +
# run the program on the tiny inputs
 +
python3 doc_finder_cpp.py /tasks/cpp/tiny_documents.txt /tasks/cpp/tiny_queries.txt > test_answer.txt
 +
# check whether the answer is correct
 +
diff /tasks/cpp/tiny_answers.txt test_answer.txt
 +
</pre>
  
<syntaxhighlight lang="Python">
+
==Task B==
from typing import List
 
  
class DocumentIndex:
+
* Compare the running time and memory required by <tt>doc_finder.py</tt> and <tt>doc_finder_cpp.py</tt>.
  def __init__(self):
+
* Use the provided list of documents <tt>/tasks/cpp/documents.txt</tt> and consider three files with queries:  
    """Initialize what you need here"""
+
** Empty file with queries to measure how long it takes to build the index
 
+
** <tt>/tasks/cpp/queries.txt</tt> to measure speed of <tt>search</tt>
  def add_document(self, document:str) -> int:
+
** <tt>/tasks/cpp/multiqueries.txt</tt> to measure the speed of <tt>multi-search</tt>
    """Adds document to index, generates unique id for it"""
+
* You can measure time and memory using <tt>/usr/bin/time -v</tt>, such as:
 +
<pre>
 +
/usr/bin/time -v python3 doc_finder.py /tasks/cpp/documents.txt /tasks/cpp/queries.txt > tmp-output.txt 2> time-measurement.txt
 +
</pre>
 +
* Interesting rows from time measurements are <tt>Elapsed (wall clock) time</tt>, <tt>Maximum resident set size</tt>; also check <tt>Percent of CPU this job got</tt>
 +
* Run each program several times (at least 3x) to see how the time and memory fluctuate.
 +
* Report the results in the protocol (in some easy to read form). Consider using summary statics such as mean and standard deviation. Optionally you can show the results in a plot or a table placed in a png or pdf file. Do not forget to discussion of the results.
 +
* Also discuss your conclusions from the measured values in the protocol.
 +
* '''Submit''' the protocol.
  
  def get_document(self, document_id:int) -> str:
+
Notes:
    """Returns document for given ID"""
+
* It might be a good idea to write some script / makefile for running programs and parsing the time and memory values. If you do so, submit your script as well.
 
+
* You can run the experiments on your computer or on vyuka. If you run on vyuka, make sure the server is not too busy (you should get a high % of CPU, close to 100%).
  def search(self, word:str) -> List[int]:
 
    """Returns list of document IDs containing the given word"""
 
  
  def multi_search(self, words: List[str]) -> List[int]:
+
==Task C==
    """Returns list of document IDs that contains all of the words"""
 
  
"""Example usage"""
+
In this task you choose one the following two options for extending your work from tasks A and B.
index = DocumentIndex()
 
  
index.add_document("this is document one about frog") # Returns 47
+
===Option 1===
index.add_document("this is document two about dog") # Returns 42
+
 
index.get_document(42) # Returns "this is document two about dog"
+
Try a different implementation of the data structure in both Python and C++. One example is using sets instead of lists of document IDs for each word and then using set intersection provided by language libraries instead of manual merge algorithm. You can also come up with a different approach, but try to keep it sensible. Compare the time and memory of your new implementations to the results from Task B.
index.search("dog") # Returns [42]
+
* '''Submit''' your implementations.
index.search("cat") # Returns []
+
* In the '''protocol''' describe the approach you chose in your implementation.
index.search("this") # Returns [47, 42] or [42, 47] order of documents is not relevant
+
* In the protocol, also give the results of the time and memory measurements and discuss your results (compare the two versions but also compare them with the previous two).
index.multi_search(["two", "frog"]) # Returns []
+
 
index.multi_search(["two", "dog"]) # Returns [42]
+
===Option 2===
</syntaxhighlight>
+
 
 +
Do a more extensive profiling of the two implementations you already have. Here are two suggestions, but you can come up with your own options as well:
 +
* Try using several sizes of the set of documents and see how the time of building the structure and its memory footprint changes. (You can use subsets or multiple copies of the existing documents.)
 +
* Try comparing queries with frequent words and those with less frequent words. How does the size of the output or intermediate lists in the multiquery influence the running time?
 +
* Do not overload vyuka server too much with your experiments, also delete large input and output files when they are no longer needed.
 +
* Create some plots showing the obtained results. '''Submit''' figures with these plots (png or pdf format) and discuss them in the '''protocol'''.
 +
* Also describe in the protocol what commands you used to create your input files. If you used some scripts for this purpose, submit them.

Latest revision as of 15:06, 2 May 2024

See the lecture

Overview

We will be solving a problem where we are given a list of documents. Each document consists of multiple words separated by space. We will assign each document an ID 0,1,... in the order they appear on input. We want to implement an indexing data structure that allows the following operations:

  • Function add_document adds a document to the index and returns its ID (the next available ID). A document is a list of words (strings).
  • Function search finds all documents that contain a given word, and returns the list of their IDs in increasing order.
  • Function multi_search gets a list of words (W1, W2, ...) and finds all documents that contain all of the words (so each document has to contain all words from the query). It returns the list of document IDs in increasing order.

Note that this functionality could be a base for a very simple text search engine. We could extend it with the option for storing and printing whole documents containing given keywords of interest, but we will not do this today.

In this homework, you are given a Python program that implements these functions using a structure called inverted index.

  • The structure consists of a dictionary in which keys are the words. For each word we keep a list of document IDs in the increasing order.
  • Function search simply returns the list from the dictionary.
  • Function multi_search uses search to return a list of IDs for each input word. It maintains an intersection of lists seen so far and computes pairwise intersection of the new list and previous intersection.
  • Intersection of two sorted lists of IDs is done by a procedure similar to merging two sorted lists in Merge Sort. One integer index is moved along each list so that we always move forward the one that points to a smaller value. If both indices point to the same value, we found an element from the intersection which is placed into the result. As soon as one index reaches the end of the list, we are finished.

The overall program doc_finder.py requires two command-line arguments. The first is a file with documents, one document per line. Words in the document are separated by spaces. The second argument is a file with queries, one query per line. If the line contains a single word, it is passed to search, otherwise to multi_search. For each query, the program prints the IDs of the found documents separated by spaces. If no documents were found, it prints a dash.

Here is an example of tiny debugging inputs and outputs:

==> tiny_documents.txt <==
this is document zero about frog
this is document one about dog and it is longer

==> tiny_queries.txt <==
dog
cat
is
is frog
dog frog
cat frog
is this

==> tiny_answers.txt <==
1
-
0 1
0
-
-
0 1

Files

Tiny debugging input and correct output:

  • /tasks/cpp/tiny_documents.txt a file with documents
  • /tasks/cpp/tiny_queries.txt a file with queries (both single and multi-word)
  • /tasks/cpp/tiny_answers.txt correct answers for the queries

Larger inputs:

  • /tasks/cpp/documents.txt a file with documents
  • /tasks/cpp/queries.txt a file with single-word queries
  • /tasks/cpp/multiqueries.txt a file with multi-word queries

Program in Python:

  • /tasks/cpp/doc_finder.py

You can run the program as follows:

# run the program on the tiny inputs
python3 /tasks/cpp/doc_finder.py /tasks/cpp/tiny_documents.txt /tasks/cpp/tiny_queries.txt > test_answers.txt
# check whether the answer is correct
diff /tasks/cpp/tiny_answers.txt test_answers.txt

Task A

The functions listed above are implemented in class DocumentIndex in file doc_finder.py. Reimplement this class in C++ and use it in Python.

  • Your Python script should be named doc_finder_cpp.py. Function main from doc_finder.py can remain almost the same as in doc_finder.py except for a slightly different initialization of index variable using the C++ class. Omit the Python implementation of the class for the file.
  • The C++ class should be placed in file DocumentIndex.cc
  • Use the same algorithm in C++ as in Python. We recommend using STL structures unorderd_map and vector.
  • Submit files doc_finder_cpp.py and DocumentIndex.cc.
  • Your implementation should successfully compile and run on the tiny inputs and produce correct output (empty diff) using the commands below:
# compile C++
c++ -O3 -Wall -shared -std=c++11 -fPIC $(python3 -m pybind11 --includes) DocumentIndex.cc -o DocumentIndex$(python3-config --extension-suffix)
# run the program on the tiny inputs
python3 doc_finder_cpp.py /tasks/cpp/tiny_documents.txt /tasks/cpp/tiny_queries.txt > test_answer.txt
# check whether the answer is correct
diff /tasks/cpp/tiny_answers.txt test_answer.txt

Task B

  • Compare the running time and memory required by doc_finder.py and doc_finder_cpp.py.
  • Use the provided list of documents /tasks/cpp/documents.txt and consider three files with queries:
    • Empty file with queries to measure how long it takes to build the index
    • /tasks/cpp/queries.txt to measure speed of search
    • /tasks/cpp/multiqueries.txt to measure the speed of multi-search
  • You can measure time and memory using /usr/bin/time -v, such as:
/usr/bin/time -v python3 doc_finder.py /tasks/cpp/documents.txt /tasks/cpp/queries.txt > tmp-output.txt 2> time-measurement.txt
  • Interesting rows from time measurements are Elapsed (wall clock) time, Maximum resident set size; also check Percent of CPU this job got
  • Run each program several times (at least 3x) to see how the time and memory fluctuate.
  • Report the results in the protocol (in some easy to read form). Consider using summary statics such as mean and standard deviation. Optionally you can show the results in a plot or a table placed in a png or pdf file. Do not forget to discussion of the results.
  • Also discuss your conclusions from the measured values in the protocol.
  • Submit the protocol.

Notes:

  • It might be a good idea to write some script / makefile for running programs and parsing the time and memory values. If you do so, submit your script as well.
  • You can run the experiments on your computer or on vyuka. If you run on vyuka, make sure the server is not too busy (you should get a high % of CPU, close to 100%).

Task C

In this task you choose one the following two options for extending your work from tasks A and B.

Option 1

Try a different implementation of the data structure in both Python and C++. One example is using sets instead of lists of document IDs for each word and then using set intersection provided by language libraries instead of manual merge algorithm. You can also come up with a different approach, but try to keep it sensible. Compare the time and memory of your new implementations to the results from Task B.

  • Submit your implementations.
  • In the protocol describe the approach you chose in your implementation.
  • In the protocol, also give the results of the time and memory measurements and discuss your results (compare the two versions but also compare them with the previous two).

Option 2

Do a more extensive profiling of the two implementations you already have. Here are two suggestions, but you can come up with your own options as well:

  • Try using several sizes of the set of documents and see how the time of building the structure and its memory footprint changes. (You can use subsets or multiple copies of the existing documents.)
  • Try comparing queries with frequent words and those with less frequent words. How does the size of the output or intermediate lists in the multiquery influence the running time?
  • Do not overload vyuka server too much with your experiments, also delete large input and output files when they are no longer needed.
  • Create some plots showing the obtained results. Submit figures with these plots (png or pdf format) and discuss them in the protocol.
  • Also describe in the protocol what commands you used to create your input files. If you used some scripts for this purpose, submit them.