2-INF-185 Integrácia dátových zdrojov 2017/18

Materiály · Úvod · Pravidlá · Kontakt
Body z už opravených úloh nájdete na serveri v /grades/userid.txt
Dátumy odovzdania projektov:
1. termín: nedeľa 4.6. 22:00
2. termín: streda 20.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatváranie známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


From IDZ
Jump to: navigation, search


In this lecture we will use Flask and simple text processing utilities from ScikitLearn.


Flask is simple web server for python (http://flask.pocoo.org/docs/0.10/quickstart/#a-minimal-application) You can find sample flask application at /tasks/hw07/simple_flask. Before running change the port number. You can then access your app at vyuka.compbio.fmph.uniba.sk:4247 (change port number).

There may be problem with access to strange port numbers due to firewalling rules. There are at least two ways to circumvent this:

  • Use X forwarding and run web browser directly from vyuka
local_machine> ssh vyuka.compbio.fmph.uniba.sk -XC
vyuka> chromium-browser
  • Create SOCKS proxy to vyuka.compbio.fmph.uniba.sk and set SOCKS proxy at that port on your local machine. Then all web traffic goes through vyuka.compbio.fmph.uniba.sk via ssh tunnel. To create SOCKS proxy server on local machine port 8000 to vyuka.compbio.fmph.uniba.sk:
local_machine> ssh vyuka.compbio.fmph.uniba.sk -D 8000

(keep ssh session open while working)

Flask uses jinja2 (http://jinja.pocoo.org/docs/dev/templates/) templating language for showing html (you can use strings in python but it is painful).

Processing text

Main tool for processing text is CountVectorizer class from ScikitLearn (http://scikit--learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). It transforms text into bag of words (for each word we get counts). Example:

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(strip_accents='unicode')

texts = [
 "Ema ma mamu.",
 "Zirafa sa vo vani kupe a hneva sa."

t = vec.fit_transform(texts).todense()



Useful things

We are working with numpy arrays here (that's array t in example above) Numpy arrays has also lots of nice tricks. First lets create two matrices:

>>> import numpy as np
>>> a = np.array([[1,2,3],[4,5,6]])
>>> b = np.array([[7,8],[9,10],[11,12]])
>>> a
array([[1, 2, 3],
       [4, 5, 6]])
>>> b
array([[ 7,  8],
       [ 9, 10],
       [11, 12]])

We can sum this matrices or multiply them by some number:

>>> 3 * a
array([[ 3,  6,  9],
       [12, 15, 18]])
>>> a + 3 * a
array([[ 4,  8, 12],
       [16, 20, 24]])

We can calculate sum of elements in each matrix, or sum by some axis:

>>> np.sum(a)
>>> np.sum(a, axis=1)
array([ 6, 15])
>>> np.sum(a, axis=0)
array([5, 7, 9])

There is a lot other useful functions check https://docs.scipy.org/doc/numpy-dev/user/quickstart.html.

This can help you get top words for each user: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort