To computers, everything is numeric. Any object can be a vector for computers to process. Here I mean any. An image, a piece of music, a piece of text, anything. Imagine every human individual can also just be represented by a (super long) vector, with all bio- and socio- info encoded in a series of numbers…
This is the same for word tokens in Natural Language Processing. Since earlier models such as Google’s Word2Vec, there have been methods to represent word tokens in vectors that carries information of word meanings. The “word token” here is used loosely. Depending on how you training your tokenizer, it can be at word level, but also other levels of different “granularity”: unigram bytes level, subword level, multi-word phrase level, and sentence level and even document level.
Every embedding method has an objective. For word tokens, it is for encoding similarity and distance between the tokens. This determines how they are trained. Word embeddings are trained from word contexts. For example, in Word2Vec [T. Mikolov et al.], it is through Continuous Bag of Words (CBOW) and Skip-grams. This makes sure that words appear in similar context are closer to each other in distance.
Instead of more theories, let’s examine some real examples. We can use the gensim library and download pretrained embeddings.import gensim.downloader
# Download the pretrained "word2vec-google-news-300" embeddings.
W2V_vectors = gensim.downloader.load('word2vec-google-news-300')
[===============================] 100.0% 1662.8/1662.8MB downloaded
Then we can query the embedding vector by using word tokens as keys.# Use the downloaded vectors as usual:
dog_vector = W2V_vectors['dog']
print("The embedding vector:\n", dog_vector)
print("The shape of the vector is ", dog_vector.shape)The embedding vector:
[ 5.12695312e-02 -2.23388672e-02 -1.72851562e-01 1.61132812e-01
-8.44726562e-02 5.73730469e-02 5.85937500e-02 -8.25195312e-02
-1.53808594e-02 -6.34765625e-02 1.79687500e-01 -4.23828125e-01
-2.25830078e-02 -1.66015625e-01 -2.51464844e-02 1.07421875e-01
-1.99218750e-01 1.59179688e-01 -1.87500000e-01 -1.20117188e-01
1.55273438e-01 -9.91210938e-02 1.42578125e-01 -1.64062500e-01
-8.93554688e-02 2.00195312e-01 -1.49414062e-01 3.20312500e-01
3.28125000e-01 2.44140625e-02 -9.71679688e-02 -8.20312500e-02
-3.63769531e-02 -8.59375000e-02 -9.86328125e-02 7.78198242e-03
-1.34277344e-02 5.27343750e-02 1.48437500e-01 3.33984375e-01
1.66015625e-02 -2.12890625e-01 -1.50756836e-02 5.24902344e-02
-1.07421875e-01 -8.88671875e-02 2.49023438e-01 -7.03125000e-02
-1.59912109e-02 7.56835938e-02 -7.03125000e-02 1.19140625e-01
2.29492188e-01 1.41601562e-02 1.15234375e-01 7.50732422e-03
2.75390625e-01 -2.44140625e-01 2.96875000e-01 3.49121094e-02
2.42187500e-01 1.35742188e-01 1.42578125e-01 1.75781250e-02
2.92968750e-02 -1.21582031e-01 2.28271484e-02 -4.76074219e-02
-1.55273438e-01 3.14331055e-03 3.45703125e-01 1.22558594e-01
-1.95312500e-01 8.10546875e-02 -6.83593750e-02 -1.47094727e-02
2.14843750e-01 -1.21093750e-01 1.57226562e-01 -2.07031250e-01
1.36718750e-01 -1.29882812e-01 5.29785156e-02 -2.71484375e-01
-2.98828125e-01 -1.84570312e-01 -2.29492188e-01 1.19140625e-01
1.53198242e-02 -2.61718750e-01 -1.23046875e-01 -1.86767578e-02
-6.49414062e-02 -8.15429688e-02 7.86132812e-02 -3.53515625e-01
5.24902344e-02 -2.45361328e-02 -5.43212891e-03 -2.08984375e-01
-2.10937500e-01 -1.79687500e-01 2.42187500e-01 2.57812500e-01
1.37695312e-01 -2.10937500e-01 -2.17285156e-02 -1.38671875e-01
1.84326172e-02 -1.23901367e-02 -1.59179688e-01 1.61132812e-01
2.08007812e-01 1.03027344e-01 9.81445312e-02 -6.83593750e-02
-8.72802734e-03 -2.89062500e-01 -2.14843750e-01 -1.14257812e-01
-2.21679688e-01 4.12597656e-02 -3.12500000e-01 -5.59082031e-02
-9.76562500e-02 5.81054688e-02 -4.05273438e-02 -1.73828125e-01
1.64062500e-01 -2.53906250e-01 -1.54296875e-01 -2.31933594e-02
-2.38281250e-01 2.07519531e-02 -2.73437500e-01 3.90625000e-03
1.13769531e-01 -1.73828125e-01 2.57812500e-01 2.35351562e-01
5.22460938e-02 6.83593750e-02 -1.75781250e-01 1.60156250e-01
-5.98907471e-04 5.98144531e-02 -2.11914062e-01 -5.54199219e-02
-7.51953125e-02 -3.06640625e-01 4.27734375e-01 5.32226562e-02
-2.08984375e-01 -5.71289062e-02 -2.09960938e-01 3.29589844e-02
1.05468750e-01 -1.50390625e-01 -9.37500000e-02 1.16699219e-01
6.44531250e-02 2.80761719e-02 2.41210938e-01 -1.25976562e-01
-1.00585938e-01 -1.22680664e-02 -3.26156616e-04 1.58691406e-02
1.27929688e-01 -3.32031250e-02 4.07714844e-02 -1.31835938e-01
9.81445312e-02 1.74804688e-01 -2.36328125e-01 5.17578125e-02
1.83593750e-01 2.42919922e-02 -4.31640625e-01 2.46093750e-01
-3.03955078e-02 -2.47802734e-02 -1.17187500e-01 1.61132812e-01
-5.71289062e-02 1.16577148e-02 2.81250000e-01 4.27734375e-01
4.56542969e-02 1.01074219e-01 -3.95507812e-02 1.77001953e-02
-8.98437500e-02 1.35742188e-01 2.08007812e-01 1.88476562e-01
-1.52343750e-01 -2.37304688e-01 -1.90429688e-01 7.12890625e-02
-2.46093750e-01 -2.61718750e-01 -2.34375000e-01 -1.45507812e-01
-1.17187500e-02 -1.50390625e-01 -1.13281250e-01 1.82617188e-01
2.63671875e-01 -1.37695312e-01 -4.58984375e-01 -4.68750000e-02
-1.26953125e-01 -4.22363281e-02 -1.66992188e-01 1.26953125e-01
2.59765625e-01 -2.44140625e-01 -2.19726562e-01 -8.69140625e-02
1.59179688e-01 -3.78417969e-02 8.97216797e-03 -2.77343750e-01
-1.04980469e-01 -1.75781250e-01 2.28515625e-01 -2.70996094e-02
2.85156250e-01 -2.73437500e-01 1.61132812e-02 5.90820312e-02
-2.39257812e-01 1.77734375e-01 -1.34765625e-01 1.38671875e-01
3.53515625e-01 1.22070312e-01 1.43554688e-01 9.22851562e-02
2.29492188e-01 -3.00781250e-01 -4.88281250e-02 -1.79687500e-01
2.96875000e-01 1.75781250e-01 4.80957031e-02 -3.38745117e-03
7.91015625e-02 -2.38281250e-01 -2.31445312e-01 1.66015625e-01
-2.13867188e-01 -7.03125000e-02 -7.56835938e-02 1.96289062e-01
-1.29882812e-01 -1.05957031e-01 -3.53515625e-01 -1.16699219e-01
-5.10253906e-02 3.39355469e-02 -1.43554688e-01 -3.90625000e-03
1.73828125e-01 -9.96093750e-02 -1.66015625e-01 -8.54492188e-02
-3.82812500e-01 5.90820312e-02 -6.22558594e-02 8.83789062e-02
-8.88671875e-02 3.28125000e-01 6.83593750e-02 -1.91406250e-01
-8.35418701e-04 1.04003906e-01 1.52343750e-01 -1.53350830e-03
4.16015625e-01 -3.32031250e-02 1.49414062e-01 2.42187500e-01
-1.76757812e-01 -4.93164062e-02 -1.24511719e-01 1.25976562e-01
1.74804688e-01 2.81250000e-01 -1.80664062e-01 1.03027344e-01
-2.75390625e-01 2.61718750e-01 2.46093750e-01 -4.71191406e-02
6.25000000e-02 4.16015625e-01 -3.55468750e-01 2.22656250e-01]
The shape of the vector is (300,)
Find the top 10 most similar words.
# Find most similar words
W2V_vectors.most_similar('dog')
[('dogs', 0.8680489659309387),
('puppy', 0.8106428384780884),
('pit_bull', 0.780396044254303),
('pooch', 0.7627376914024353),
('cat', 0.7609457969665527),
('golden_retriever', 0.7500901818275452),
('German_shepherd', 0.7465174198150635),
('Rottweiler', 0.7437615394592285),
('beagle', 0.7418621778488159),
('pup', 0.740691065788269)]
Then we can do some interesting arithmetic operations to get new words based on distance.
# Getting most similar workds based on a given distance.
W2V_vectors.most_similar(positive=['woman', 'king'], negative=['man'],topn=1)
[('queen', 0.7118193507194519)]
If the input parameters looks a bit confusing, here is an alternative view:
If the following holds:
woman – man = queen – king
then, the following must be true:
woman – man + king = queen
You can see from the above. On the left side of the equation, “woman” and “king” are the positive values and “man” is the negative value, and that’s how we get “queen” on the right side.