tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Learning words from man pages



On Sun, Jun 12, 2016 at 4:50 AM, Abhinav Upadhyay
<er.abhinav.upadhyay%gmail.com@localhost> wrote:
> Hi All,
>
> I came across an interesting paper from Google on machine learning[1],
> where they came up with an efficient representation for words from a
> corpus. These representations are called word embeddings in general,
> and they have titled their method as word2vec.
>
> It is a two layer neural network which given a corpus as input,
> produces a set of word vectors as its output. These vectors represent
> each word in the corpus in a vector space, where words with similar
> semantics lie nearer to each other in that space.
>
> There are two methods of training the data:
> 1. Bag of words: here the ordering of the words in the corpus is not
> considered. It can be thought of like, given a word, what are the
> other words similar to this.
> 2. Skip grams: It considers the ordering of the words, it can be
> thought of like, if given word w1, what is the probability of word w2
> appearing next.
>
> They have shown interesting implications of this, for example,
> "France" and "Italy" are closer to each other in the model that they
> trained. Another interesting observation is the application of vector
> algebra here, for example they show that:
>
> vector(king) - vector(man) + vector(woman) = vector(queen).
>
> This technique is becoming widely popular and has applications in
> areas like search, question answering, summarization. I've trained
> this on our man page corpus data (plus some man pages from pkgsrc) and
> put a demo here: https://man-k.org/words/
>
> Some of the interesting queries that I found:
> bug: gives <defect, problem, undetected, lurk etc> in the top results
> man: gives <mdoc, html, overview, readme>
> netbsd: shows <freebsd, openbsd, ultrix, linux>
> christos: <zoulas, cornell> in the top two
>
> Give it a try and let me know how you like it. :)
>
> Coming Soon: I still need to implement the interface for doing vector
> addition and subtraction.
>
> BUGS: Use single word queries in non-plural form for best experience ;)
>
> [1] http://arxiv.org/pdf/1301.3781.pdf
>

Just fixed the internal server error, I guess I broke it right after
sending the email :(

-
Abhinav


Home | Main Index | Thread Index | Old Index