tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Learning words from man pages

On Sun, Jun 12, 2016 at 4:50 AM, Abhinav Upadhyay
<> wrote:
> Hi All,
> I came across an interesting paper from Google on machine learning[1],
> where they came up with an efficient representation for words from a
> corpus. These representations are called word embeddings in general,
> and they have titled their method as word2vec.
> It is a two layer neural network which given a corpus as input,
> produces a set of word vectors as its output. These vectors represent
> each word in the corpus in a vector space, where words with similar
> semantics lie nearer to each other in that space.
> There are two methods of training the data:
> 1. Bag of words: here the ordering of the words in the corpus is not
> considered. It can be thought of like, given a word, what are the
> other words similar to this.
> 2. Skip grams: It considers the ordering of the words, it can be
> thought of like, if given word w1, what is the probability of word w2
> appearing next.
> They have shown interesting implications of this, for example,
> "France" and "Italy" are closer to each other in the model that they
> trained. Another interesting observation is the application of vector
> algebra here, for example they show that:
> vector(king) - vector(man) + vector(woman) = vector(queen).
> This technique is becoming widely popular and has applications in
> areas like search, question answering, summarization. I've trained
> this on our man page corpus data (plus some man pages from pkgsrc) and
> put a demo here:
> Some of the interesting queries that I found:
> bug: gives <defect, problem, undetected, lurk etc> in the top results
> man: gives <mdoc, html, overview, readme>
> netbsd: shows <freebsd, openbsd, ultrix, linux>
> christos: <zoulas, cornell> in the top two
> Give it a try and let me know how you like it. :)
> Coming Soon: I still need to implement the interface for doing vector
> addition and subtraction.
> BUGS: Use single word queries in non-plural form for best experience ;)
> [1]

Just fixed the internal server error, I guess I broke it right after
sending the email :(


Home | Main Index | Thread Index | Old Index