2-INF-185 Integrácia dátových zdrojov 2016/17

Materiály · Úvod · Pravidlá · Kontakt
HW10 a HW11 odovzdajte do utorka 30.5. 9:00.
Dátumy odovzdania projektov:
1. termín: nedeľa 11.6. 22:00
2. termín: streda 21.6. 22:00
Oba termíny sú riadne, prvý je určený pre študentov končiacich štúdium alebo tých, čo chcú mať predmet ukončený skôr. V oboch prípadoch sa pár dní po odvzdaní budú konať krátke osobné stretnutia s vyučujúcimi (diskusia k projektu a uzatvárane známky). Presné dni a časy dohodneme neskôr. Projekty odovzdajte podobne ako domáce úlohy do /submit/projekt


L08

From IDZ
Jump to: navigation, search

In this lecture we will use Flask and simple text processing utilities from ScikitLearn.

Flask

Flask is simple web server for python (http://flask.pocoo.org/docs/0.10/quickstart/#a-minimal-application) You can find sample flask application at /tasks/hw08/simple_flask. Before running change the port number. You can then access your app at vyuka.compbio.fmph.uniba.sk:4247 (change port number).

There may be problem with access to strange port numbers due to firewalling rules. There are at least two ways to circumvent this:

  • Use X forwarding and run web browser directly from vyuka
local_machine> ssh vyuka.compbio.fmph.uniba.sk -XC
vyuka> chromium-browser
  • Create SOCKS proxy to vyuka.compbio.fmph.uniba.sk and set SOCKS proxy at that port on your local machine. Then all web traffic goes through vyuka.compbio.fmph.uniba.sk via ssh tunnel. To create SOCKS proxy server on local machine port 8000 to vyuka.compbio.fmph.uniba.sk:
local_machine> ssh vyuka.compbio.fmph.uniba.sk -D 8000

(keep ssh session open while working)

Flask uses jinja2 (http://jinja.pocoo.org/docs/dev/templates/) templating language for showing html (you can use strings in python but it is painful).

Processing text

Main tool for processing text is CountVectorizer class from ScikitLearn (http://scikit--learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). It transforms text into bag of words (for each word we get counts). Example:

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(strip_accents='unicode')

texts = [
 "Ema ma mamu.",
 "Zirafa sa vo vani kupe a hneva sa."
]

t = vec.fit_transform(texts).todense()

print t

print vec.vocabulary_

Useful things

We are working with numpy arrays here (that's array t in example above) Numpy arrays has also lots of nice tricks. First lets create two matrices:

>>> import numpy as np
>>> a = np.array([[#1,2,3],[4,5,6]])
>>> b = np.array([[#7,8],[9,10],[11,12]])
>>> a
array([[#1, 2, 3],
       [4, 5, 6]])
>>> b
array([[# 7,  8],
       [ 9, 10],
       [11, 12]])

We can sum this matrices or multiply them by some number:

>>> 3 * a
array([[# 3,  6,  9],
       [12, 15, 18]])
>>> a + 3 * a
array([[# 4,  8, 12],
       [16, 20, 24]])

We can calculate sum of elements in each matrix, or sum by some axis:

>>> np.sum(a)
21
>>> np.sum(a, axis=1)
array([ 6, 15])
>>> np.sum(a, axis=0)
array([5, 7, 9])

There is a lot other usefull functions check https://docs.scipy.org/doc/numpy-dev/user/quickstart.html.

This can help you get top words for each user: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html#numpy.argsort