Parameter Redundancy in Large Language Models

Nowadays the LLMs are getting larger and larger. While larger models generally have better performance, there are many scenarios where a smaller model can perform as good. If you goal is to training and run LLMs for specific tasks, there is likely redundancies in large models. This means that you don’t have to always use larger models for better performance, especially when the budget is limited.

Here are few scenarios where it is possible to training and run smaller models to achieve results that are as good as bigger models.

Downsizing the models through quantization for your task.

If the model is trained and stored in 32-bit floating point values, you may try mapping it to a 16-bit floating point and evaluate the performance and if it can achieve the performance you require.

Pruning parameters.

Based on the Lottery Ticket Hypothesis, if you choose the right subset of parameters and initialize them correctly, you can achieve the same level of performance as the fully trained model.

Model Distillation.

You can train a smaller model to mimic the behavior of a bigger model.

Training through LoRA or QLoRA.

When retraining a model and updating earlier layers of parameters, you don’t have to retraining all the parameters, which may lead to catastrophic forgetting. You can instead freeze the original weights and training a set of delta weights. The delta weights can be much smaller in size through SVD. Basically what you need to training is no longer the same size of the original parameters but the size of the decomposed matrices which is much smaller.

Model training with domain specific data.

If you have high quality domain specific data, you can train a model that is much smaller in size but still delivers good performance as larger models. For example, BloombergGPT, with 50 billion parameters, can achieve similar performance and even outperforms other bigger models in certain tasks. This shows that the number of parameters is not the only important factor. Note that training a model like BloombergGPT still takes a significant amount of resources and cost.

Embeddings — everything can be a vector

To computers, everything is numeric. Any object can be a vector for computers to process. Here I mean any. An image, a piece of music, a piece of text, anything. Imagine every human individual can also just be represented by a (super long) vector, with all bio- and socio- info encoded in a series of numbers…

This is the same for word tokens in Natural Language Processing. Since earlier models such as Google’s Word2Vec, there have been methods to represent word tokens in vectors that carries information of word meanings. The “word token” here is used loosely. Depending on how you training your tokenizer, it can be at word level, but also other levels of different “granularity”: unigram bytes level, subword level, multi-word phrase level, and sentence level and even document level.

Every embedding method has an objective. For word tokens, it is for encoding similarity and distance between the tokens. This determines how they are trained. Word embeddings are trained from word contexts. For example, in Word2Vec [T. Mikolov et al.], it is through Continuous Bag of Words (CBOW) and Skip-grams. This makes sure that words appear in similar context are closer to each other in distance.

Instead of more theories, let’s examine some real examples. We can use the gensim library and download pretrained embeddings.import gensim.downloader

# Download the pretrained "word2vec-google-news-300" embeddings.
W2V_vectors = gensim.downloader.load('word2vec-google-news-300')

[===============================] 100.0% 1662.8/1662.8MB downloaded

Then we can query the embedding vector by using word tokens as keys.# Use the downloaded vectors as usual:

dog_vector = W2V_vectors['dog']
print("The embedding vector:\n", dog_vector)
print("The shape of the vector is ", dog_vector.shape)The embedding vector:

[ 5.12695312e-02 -2.23388672e-02 -1.72851562e-01 1.61132812e-01
-8.44726562e-02 5.73730469e-02 5.85937500e-02 -8.25195312e-02
-1.53808594e-02 -6.34765625e-02 1.79687500e-01 -4.23828125e-01
-2.25830078e-02 -1.66015625e-01 -2.51464844e-02 1.07421875e-01
-1.99218750e-01 1.59179688e-01 -1.87500000e-01 -1.20117188e-01
1.55273438e-01 -9.91210938e-02 1.42578125e-01 -1.64062500e-01
-8.93554688e-02 2.00195312e-01 -1.49414062e-01 3.20312500e-01
3.28125000e-01 2.44140625e-02 -9.71679688e-02 -8.20312500e-02
-3.63769531e-02 -8.59375000e-02 -9.86328125e-02 7.78198242e-03
-1.34277344e-02 5.27343750e-02 1.48437500e-01 3.33984375e-01
1.66015625e-02 -2.12890625e-01 -1.50756836e-02 5.24902344e-02
-1.07421875e-01 -8.88671875e-02 2.49023438e-01 -7.03125000e-02
-1.59912109e-02 7.56835938e-02 -7.03125000e-02 1.19140625e-01
2.29492188e-01 1.41601562e-02 1.15234375e-01 7.50732422e-03
2.75390625e-01 -2.44140625e-01 2.96875000e-01 3.49121094e-02
2.42187500e-01 1.35742188e-01 1.42578125e-01 1.75781250e-02
2.92968750e-02 -1.21582031e-01 2.28271484e-02 -4.76074219e-02
-1.55273438e-01 3.14331055e-03 3.45703125e-01 1.22558594e-01
-1.95312500e-01 8.10546875e-02 -6.83593750e-02 -1.47094727e-02
2.14843750e-01 -1.21093750e-01 1.57226562e-01 -2.07031250e-01
1.36718750e-01 -1.29882812e-01 5.29785156e-02 -2.71484375e-01
-2.98828125e-01 -1.84570312e-01 -2.29492188e-01 1.19140625e-01
1.53198242e-02 -2.61718750e-01 -1.23046875e-01 -1.86767578e-02
-6.49414062e-02 -8.15429688e-02 7.86132812e-02 -3.53515625e-01
5.24902344e-02 -2.45361328e-02 -5.43212891e-03 -2.08984375e-01
-2.10937500e-01 -1.79687500e-01 2.42187500e-01 2.57812500e-01
1.37695312e-01 -2.10937500e-01 -2.17285156e-02 -1.38671875e-01
1.84326172e-02 -1.23901367e-02 -1.59179688e-01 1.61132812e-01
2.08007812e-01 1.03027344e-01 9.81445312e-02 -6.83593750e-02
-8.72802734e-03 -2.89062500e-01 -2.14843750e-01 -1.14257812e-01
-2.21679688e-01 4.12597656e-02 -3.12500000e-01 -5.59082031e-02
-9.76562500e-02 5.81054688e-02 -4.05273438e-02 -1.73828125e-01
1.64062500e-01 -2.53906250e-01 -1.54296875e-01 -2.31933594e-02
-2.38281250e-01 2.07519531e-02 -2.73437500e-01 3.90625000e-03
1.13769531e-01 -1.73828125e-01 2.57812500e-01 2.35351562e-01
5.22460938e-02 6.83593750e-02 -1.75781250e-01 1.60156250e-01
-5.98907471e-04 5.98144531e-02 -2.11914062e-01 -5.54199219e-02
-7.51953125e-02 -3.06640625e-01 4.27734375e-01 5.32226562e-02
-2.08984375e-01 -5.71289062e-02 -2.09960938e-01 3.29589844e-02
1.05468750e-01 -1.50390625e-01 -9.37500000e-02 1.16699219e-01
6.44531250e-02 2.80761719e-02 2.41210938e-01 -1.25976562e-01
-1.00585938e-01 -1.22680664e-02 -3.26156616e-04 1.58691406e-02
1.27929688e-01 -3.32031250e-02 4.07714844e-02 -1.31835938e-01
9.81445312e-02 1.74804688e-01 -2.36328125e-01 5.17578125e-02
1.83593750e-01 2.42919922e-02 -4.31640625e-01 2.46093750e-01
-3.03955078e-02 -2.47802734e-02 -1.17187500e-01 1.61132812e-01
-5.71289062e-02 1.16577148e-02 2.81250000e-01 4.27734375e-01
4.56542969e-02 1.01074219e-01 -3.95507812e-02 1.77001953e-02
-8.98437500e-02 1.35742188e-01 2.08007812e-01 1.88476562e-01
-1.52343750e-01 -2.37304688e-01 -1.90429688e-01 7.12890625e-02
-2.46093750e-01 -2.61718750e-01 -2.34375000e-01 -1.45507812e-01
-1.17187500e-02 -1.50390625e-01 -1.13281250e-01 1.82617188e-01
2.63671875e-01 -1.37695312e-01 -4.58984375e-01 -4.68750000e-02
-1.26953125e-01 -4.22363281e-02 -1.66992188e-01 1.26953125e-01
2.59765625e-01 -2.44140625e-01 -2.19726562e-01 -8.69140625e-02
1.59179688e-01 -3.78417969e-02 8.97216797e-03 -2.77343750e-01
-1.04980469e-01 -1.75781250e-01 2.28515625e-01 -2.70996094e-02
2.85156250e-01 -2.73437500e-01 1.61132812e-02 5.90820312e-02
-2.39257812e-01 1.77734375e-01 -1.34765625e-01 1.38671875e-01
3.53515625e-01 1.22070312e-01 1.43554688e-01 9.22851562e-02
2.29492188e-01 -3.00781250e-01 -4.88281250e-02 -1.79687500e-01
2.96875000e-01 1.75781250e-01 4.80957031e-02 -3.38745117e-03
7.91015625e-02 -2.38281250e-01 -2.31445312e-01 1.66015625e-01
-2.13867188e-01 -7.03125000e-02 -7.56835938e-02 1.96289062e-01
-1.29882812e-01 -1.05957031e-01 -3.53515625e-01 -1.16699219e-01
-5.10253906e-02 3.39355469e-02 -1.43554688e-01 -3.90625000e-03
1.73828125e-01 -9.96093750e-02 -1.66015625e-01 -8.54492188e-02
-3.82812500e-01 5.90820312e-02 -6.22558594e-02 8.83789062e-02
-8.88671875e-02 3.28125000e-01 6.83593750e-02 -1.91406250e-01
-8.35418701e-04 1.04003906e-01 1.52343750e-01 -1.53350830e-03
4.16015625e-01 -3.32031250e-02 1.49414062e-01 2.42187500e-01
-1.76757812e-01 -4.93164062e-02 -1.24511719e-01 1.25976562e-01
1.74804688e-01 2.81250000e-01 -1.80664062e-01 1.03027344e-01
-2.75390625e-01 2.61718750e-01 2.46093750e-01 -4.71191406e-02
6.25000000e-02 4.16015625e-01 -3.55468750e-01 2.22656250e-01]
The shape of the vector is (300,)

Find the top 10 most similar words.

# Find most similar words
W2V_vectors.most_similar('dog')

[('dogs', 0.8680489659309387),
('puppy', 0.8106428384780884),
('pit_bull', 0.780396044254303),
('pooch', 0.7627376914024353),
('cat', 0.7609457969665527),
('golden_retriever', 0.7500901818275452),
('German_shepherd', 0.7465174198150635),
('Rottweiler', 0.7437615394592285),
('beagle', 0.7418621778488159),
('pup', 0.740691065788269)]

Then we can do some interesting arithmetic operations to get new words based on distance.

# Getting most similar workds based on a given distance.
W2V_vectors.most_similar(positive=['woman', 'king'], negative=['man'],topn=1)

[('queen', 0.7118193507194519)]

If the input parameters looks a bit confusing, here is an alternative view:

If the following holds:

woman – man = queen – king

then, the following must be true:

woman – man + king = queen

You can see from the above. On the left side of the equation, “woman” and “king” are the positive values and “man” is the negative value, and that’s how we get “queen” on the right side.