tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Learning words from man pages

Hi All,

I came across an interesting paper from Google on machine learning[1],
where they came up with an efficient representation for words from a
corpus. These representations are called word embeddings in general,
and they have titled their method as word2vec.

It is a two layer neural network which given a corpus as input,
produces a set of word vectors as its output. These vectors represent
each word in the corpus in a vector space, where words with similar
semantics lie nearer to each other in that space.

There are two methods of training the data:
1. Bag of words: here the ordering of the words in the corpus is not
considered. It can be thought of like, given a word, what are the
other words similar to this.
2. Skip grams: It considers the ordering of the words, it can be
thought of like, if given word w1, what is the probability of word w2
appearing next.

They have shown interesting implications of this, for example,
"France" and "Italy" are closer to each other in the model that they
trained. Another interesting observation is the application of vector
algebra here, for example they show that:

vector(king) - vector(man) + vector(woman) = vector(queen).

This technique is becoming widely popular and has applications in
areas like search, question answering, summarization. I've trained
this on our man page corpus data (plus some man pages from pkgsrc) and
put a demo here:

Some of the interesting queries that I found:
bug: gives <defect, problem, undetected, lurk etc> in the top results
man: gives <mdoc, html, overview, readme>
netbsd: shows <freebsd, openbsd, ultrix, linux>
christos: <zoulas, cornell> in the top two

Give it a try and let me know how you like it. :)

Coming Soon: I still need to implement the interface for doing vector
addition and subtraction.

BUGS: Use single word queries in non-plural form for best experience ;)



Home | Main Index | Thread Index | Old Index