1-DAV-202 Data Management 2023/24
Previously 2-INF-185 Data Source Integration
Difference between revisions of "HWcpp"
Line 3: | Line 3: | ||
<!-- /NOTEX --> | <!-- /NOTEX --> | ||
− | + | In both tasks, you should implement all functionalities in two versions: | |
* Pure Python | * Pure Python | ||
* Python interface with C++ implementation | * Python interface with C++ implementation | ||
+ | |||
+ | Then, you should compare the running speed of both implementations. | ||
===Task A=== | ===Task A=== | ||
Line 16: | Line 18: | ||
* Retrieve a document with the given ID | * Retrieve a document with the given ID | ||
* Find all documents that contain the given word | * Find all documents that contain the given word | ||
+ | |||
+ | You should ideally implement this as an inverted index, i.e. for each word keep a list of documents containing it. | ||
This is the recommended interface for your datastructure with its example usage: | This is the recommended interface for your datastructure with its example usage: | ||
Line 50: | Line 54: | ||
Extend you index so that it support following query: | Extend you index so that it support following query: | ||
* Given list of words (W1, W2, ...) find all documents, which contain all of the words (so each document has to contain all of the word from the query). | * Given list of words (W1, W2, ...) find all documents, which contain all of the words (so each document has to contain all of the word from the query). | ||
+ | |||
+ | You should ideally build on top of functionality from previous task. For each word find documents containing it and then find overlap in the results. | ||
This is the recommended interface for your datastructure with its example usage: | This is the recommended interface for your datastructure with its example usage: |
Revision as of 11:03, 12 April 2024
See the lecture
In both tasks, you should implement all functionalities in two versions:
- Pure Python
- Python interface with C++ implementation
Then, you should compare the running speed of both implementations.
Task A
You are given a list of documents. Each document consists of multiple words separated by space. You should design and implement an indexing datastructure that allows the following two operations:
- Add a document to the index (your index should assign it a unique ID)
- Retrieve a document with the given ID
- Find all documents that contain the given word
You should ideally implement this as an inverted index, i.e. for each word keep a list of documents containing it.
This is the recommended interface for your datastructure with its example usage:
from typing import List
class DocumentIndex:
def __init__(self):
"""Initialize what you need here"""
def add_document(self, document:str) -> int:
"""Adds document to index, generates unique id for it"""
def get_document(self, document_id:int) -> str:
"""Returns document for given ID"""
def search(self, word:str) -> List[int]:
"""Returns list of document IDs containing the given word"""
"""Example usage"""
index = DocumentIndex()
index.add_document("this is document one about frog") # Returns 47
index.add_document("this is document two about dog") # Returns 42
index.get_document(42) # Returns "this is document two about dog"
index.search("dog") # Returns [42]
index.search("cat") # Returns []
index.search("this") # Returns [47, 42] or [42, 47] order of documents is not relevant
Task B
Extend you index so that it support following query:
- Given list of words (W1, W2, ...) find all documents, which contain all of the words (so each document has to contain all of the word from the query).
You should ideally build on top of functionality from previous task. For each word find documents containing it and then find overlap in the results.
This is the recommended interface for your datastructure with its example usage:
from typing import List
class DocumentIndex:
def __init__(self):
"""Initialize what you need here"""
def add_document(self, document:str) -> int:
"""Adds document to index, generates unique id for it"""
def get_document(self, document_id:int) -> str:
"""Returns document for given ID"""
def search(self, word:str) -> List[int]:
"""Returns list of document IDs containing the given word"""
def multi_search(self, words: List[str]) -> List[int]:
"""Returns list of document IDs that contains all of the words"""
"""Example usage"""
index = DocumentIndex()
index.add_document("this is document one about frog") # Returns 47
index.add_document("this is document two about dog") # Returns 42
index.get_document(42) # Returns "this is document two about dog"
index.search("dog") # Returns [42]
index.search("cat") # Returns []
index.search("this") # Returns [47, 42] or [42, 47] order of documents is not relevant
index.multi_search(["two", "frog"]) # Returns []
index.multi_search(["two", "dog"]) # Returns [42]