The Word2Vec is an iteration based Natural Language Proceessing (NLP) framework for learning word vectors, developed by Thomas Mikolov. For each word position in a large corpus of text we define a fixed window size and the likelihood function :

We define the objective function as the average negative log-likelihood:

Say a word is more likely to appear together with than , we want to define a probability measure such that:

We achieve this by creating two word vectors (center) and (outside) for each word and using the dot product to represent the similarity of two words. The we use the softmax function to transform the dot products into a probability measure. Note that a softmax function amplifies probabilities for the largest values while still assign some probabilities to smaller values.

Gradient Calculation

We can now derive the gradient of the objective function. First look at :

Looking at the derivative inside the double summation and use the fact that:

Therefore,

Computing the other derivative w.r.t.

We used two vector for each word such that the optimization is easier. Eventually we average the two vectors for a given word. We can now use gradient descent to update the model parameters .

Since can be expensive to calculate for the entire corpus of text, we use stochastic gradient descent which updates with randomly selected samples of text.

Example

requests Requests is a library for making HTTP requests in Python. HTTP functions as a request–response protocol in the client–server computing model. A web browser, for example, may be the client and an application running on a computer hosting a website may be the server.

  • The client submits an HTTP request message to the server.
  • The server, which provides resources such as HTML files and other content, or performs other functions on behalf of the client, returns a response message to the client. The response contains completion status information about the request and may also contain requested content in its message body.

A Uniform Resource Locator (URL) is a reference to a web resource, typically includes a protocol (http), a hostname (www.website.com), and a file name (index.html).

bs4 Beautiful Soup is a Python library for pulling data out of HTML and XML files, and it can work with a parser to provide functionality of navigating, searching, and modifying the parse tree. Both HTML and XML are markup languages designed for storing and transporting data.

re Provides regular expression matching operations.

nltk The Natural Language Toolkit is a Python platform for working with human language data. It provides interfaces to lexical resources as well as a suite of text processing libraries.

gensim Gensim is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning.

1
2
3
4
5
6
7
8
9
10
11
import requests
import bs4
import re
import nltk
from nltk.corpus import stopwords
from gensim.models import Word2Vec

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
response = requests.get(url).content
soup = bs4.BeautifulSoup(response, "lxml").find_all('p')
raw_data = "".join([s.text for s in soup])

Perform data cleaning and retain only key words from the parsed data.

1
2
3
4
5
data  = raw_data.lower()
data = re.sub('[^a-zA-Z]', ' ', data)
data = re.sub(r'\s+', ' ', data)
words = [nltk.word_tokenize(sentence) for sentence in nltk.sent_tokenize(data)]
words = [[w for w in words[i] if w not in stopwords.words('english')] for i in range(len(words))]

Train Word2Vec model.

1
model = Word2Vec(words, min_count=2)

View the word vector for the word ‘python’:

1
model.wv['python']

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
array([ 0.00272543,  0.00044483, -0.002916  , -0.00091269, -0.0028646 ,
0.00312438, -0.00202833, 0.00418179, 0.00512605, 0.00398485,
0.00383619, -0.00527396, 0.00136241, 0.00048289, 0.00393969,
0.00363039, 0.00516506, 0.00065952, 0.00279527, -0.00289068,
-0.0002257 , -0.00423192, 0.00389712, 0.00435556, -0.00169954,
0.00372895, -0.00204547, 0.00031502, -0.000558 , 0.00302919,
-0.00329071, -0.00445723, -0.00090491, -0.00073062, 0.0055988 ,
-0.00180256, -0.00032414, 0.00129002, -0.00077443, -0.00511642,
0.00082947, 0.00207873, 0.00064075, 0.00320432, 0.00252466,
0.00165025, 0.00274325, 0.00557919, 0.00422611, -0.00348751,
0.00488516, 0.00238723, -0.0034958 , 0.00119023, -0.0009317 ,
-0.00051728, -0.00448227, -0.00145251, 0.00098566, -0.00352017,
-0.00017685, 0.00388307, 0.00305843, -0.00614224, 0.00319819,
-0.0038121 , 0.00025529, -0.00525783, -0.00364403, 0.00531866,
-0.00040134, 0.00509736, -0.00279795, -0.00520586, 0.00088609,
-0.00209225, 0.00341286, 0.00403736, -0.00360165, -0.0025662 ,
-0.00442059, -0.00286324, -0.00441705, -0.00248354, -0.00311305,
-0.0017566 , -0.00094437, 0.00274204, -0.00165205, -0.00576063,
-0.00485601, 0.00124242, -0.00536304, -0.00042135, -0.00091981,
0.00247618, -0.0041995 , -0.00123214, -0.00465355, 0.00184935],
dtype=float32)

View words that are most similar to ‘python’:

1
model.wv.most_similar('python')

1
2
3
4
5
6
7
8
9
10
[('revision', 0.3588007390499115),
('system', 0.35286271572113037),
('feature', 0.33182504773139954),
('integer', 0.3286311626434326),
('systems', 0.3157789707183838),
('x', 0.3120567798614502),
('numbers', 0.30488497018814087),
('supports', 0.29797300696372986),
('language', 0.28043025732040405),
('major', 0.2673966884613037)]

Some other resources: stackabuse, skymind and google code




Reference: