Language Fashions for Sentence Completion | by Dhruv Matani | Sep, 2023

Let’s construct a easy LSTM mannequin and practice it to foretell the subsequent token given a prefix of tokens. Now, you would possibly ask what a token is.


Usually for language fashions, a token can imply

  1. A single character (or a single byte)
  2. A complete phrase within the goal language
  3. One thing in between 1 and a pair of. That is often known as a sub-word

Mapping a single character (or byte) to a token may be very restrictive since we’re overloading that token to carry a variety of context about the place it happens. It is because the character “c” for instance, happens in many various phrases, and to foretell the subsequent character after we see the character “c” requires us to actually look onerous on the main context.

Mapping a single phrase to a token can also be problematic since English itself has anyplace between 250k and 1 million phrases. As well as, what occurs when a brand new phrase is added to the language? Do we have to return and re-train your complete mannequin to account for this new phrase?

Sub-word tokenization is taken into account the trade commonplace within the yr 2023. It assigns substrings of bytes steadily occurring collectively to distinctive tokens. Usually, language fashions have anyplace from a couple of thousand (say 4,000) to tens of 1000’s (say 60,000) of distinctive tokens. The algorithm to find out what constitutes a token is set by the BPE (Byte pair encoding) algorithm.

To decide on the variety of distinctive tokens in our vocabulary (known as the vocabulary measurement), we have to be conscious of some issues:

  1. If we select too few tokens, we’re again within the regime of a token per character, and it’s onerous for the mannequin to study something helpful.
  2. If we select too many tokens, we find yourself in a scenario the place the mannequin’s embedding tables over-shadow the remainder of the mannequin’s weight and it turns into onerous to deploy the mannequin in a constrained setting. The scale of the embedding desk will depend upon the variety of dimensions we use for every token. It’s not unusual to make use of a measurement of 256, 512, 786, and so on… If we use a token embedding dimension of 512, and we’ve got 100k tokens, we find yourself with an embedding desk that makes use of 200MiB in reminiscence.

Therefore, we have to strike a stability when selecting the vocabulary measurement. On this instance, we choose 6600 tokens and practice our tokenizer with a vocabulary measurement of 6600. Subsequent, let’s check out the mannequin definition itself.

The PyTorch Mannequin

The mannequin itself is fairly easy. Now we have the next layers:

  1. Token Embedding (vocab measurement=6600, embedding dim=512), for a complete measurement of about 15MiB (assuming 4 byte float32 because the embedding desk’s knowledge sort)
  2. LSTM (num layers=1, hidden dimension=786) for a complete measurement of about 16MiB
  3. Multi-Layer Perceptron (786 to 3144 to 6600 dimensions) for a complete measurement of about 93MiB

The entire mannequin has about 31M trainable parameters for a complete measurement of about 120MiB.

Right here’s the PyTorch code for the mannequin.

class WordPredictionLSTMModel(nn.Module):
def __init__(self, num_embed, embed_dim, pad_idx, lstm_hidden_dim, lstm_num_layers, output_dim, dropout):
self.vocab_size = num_embed
self.embed = nn.Embedding(num_embed, embed_dim, pad_idx)
self.lstm = nn.LSTM(embed_dim, lstm_hidden_dim, lstm_num_layers, batch_first=True, dropout=dropout)
self.fc = nn.Sequential(
nn.Linear(lstm_hidden_dim, lstm_hidden_dim * 4),
nn.LayerNorm(lstm_hidden_dim * 4),

nn.Linear(lstm_hidden_dim * 4, output_dim),

def ahead(self, x):
x = self.embed(x)
x, _ = self.lstm(x)
x = self.fc(x)
x = x.permute(0, 2, 1)
return x

Right here’s the mannequin abstract utilizing torchinfo.

LSTM Mannequin Abstract

Layer (sort:depth-idx) Param #
WordPredictionLSTMModel -
├─Embedding: 1–1 3,379,200
├─LSTM: 1–2 4,087,200
├─Sequential: 1–3 -
│ └─Linear: 2–1 2,474,328
│ └─LayerNorm: 2–2 6,288
│ └─LeakyReLU: 2–3 -
│ └─Dropout: 2–4 -
│ └─Linear: 2–5 20,757,000
Complete params: 30,704,016
Trainable params: 30,704,016
Non-trainable params: 0

Deciphering the accuracy: After coaching this mannequin on 12M English language sentences for about 8 hours on a P100 GPU, we achieved a lack of 4.03, a top-1 accuracy of 29% and a top-5 accuracy of 49%. Which means that 29% of the time, the mannequin was in a position to appropriately predict the subsequent token, and 49% of the time, the subsequent token within the coaching set was one of many high 5 predictions by the mannequin.

What ought to our success metric be? Whereas the top-1 and top-5 accuracy numbers for our mannequin aren’t spectacular, they aren’t as necessary for our downside. Our candidate phrases are a small set of doable phrases that match the swipe sample. What we would like from our mannequin is to have the ability to choose a great candidate to finish the sentence such that it’s syntactically and semantically coherent. Since our mannequin learns the nature of language by means of the coaching knowledge, we anticipate it to assign a better likelihood to coherent sentences. For instance, if we’ve got the sentence “The baseball participant” and doable completion candidates (“ran”, “swam”, “hid”), then the phrase “ran” is a greater follow-up phrase than the opposite two. So, if our mannequin predicts the phrase ran with a better likelihood than the remaining, it really works for us.

Deciphering the loss: A lack of 4.03 implies that the detrimental log-likelihood of the prediction is 4.03, which implies that the likelihood of predicting the subsequent token appropriately is e^-4.03 = 0.0178 or 1/56. A randomly initialized mannequin usually has a lack of about 8.8 which is -log_e(1/6600), for the reason that mannequin randomly predicts 1/6600 tokens (6600 being the vocabulary measurement). Whereas a lack of 4.03 could not appear nice, it’s necessary to do not forget that the skilled mannequin is about 120x higher than an untrained (or randomly initialized) mannequin.

Subsequent, let’s check out how we are able to use this mannequin to enhance recommendations from our swipe keyboard.

Utilizing the mannequin to prune invalid recommendations

Let’s check out an actual instance. Suppose we’ve got a partial sentence “I feel”, and the consumer makes the swipe sample proven in blue beneath, beginning at “o”, going between the letters “c” and “v”, and ending between the letters “e” and “v”.

Some doable phrases that might be represented by this swipe sample are

  1. Over
  2. Oct (quick for October)
  3. Ice
  4. I’ve (with the apostrophe implied)

Of those recommendations, the most certainly one might be going to be “I’ve”. Let’s feed these recommendations into our mannequin and see what it spits out.

[I think] [I've] = 0.00087
[I think] [over] = 0.00051
[I think] [ice] = 0.00001
[I think] [Oct] = 0.00000

The worth after the = signal is the likelihood of the phrase being a sound completion of the sentence prefix. On this case, we see that the phrase “I’ve” has been assigned the very best likelihood. Therefore, it’s the most certainly phrase to comply with the sentence prefix “I feel”.

The following query you might need is how we are able to compute these next-word possibilities. Let’s have a look.

Computing the subsequent phrase likelihood

To compute the likelihood {that a} phrase is a sound completion of a sentence prefix, we run the mannequin in eval (inference) mode and feed within the tokenized sentence prefix. We additionally tokenize the phrase after including a whitespace prefix to the phrase. That is completed as a result of the HuggingFace pre-tokenizer splits phrases with areas originally of the phrase, so we need to guarantee that our inputs are in step with the tokenization technique utilized by HuggingFace Tokenizers.

Let’s assume that the candidate phrase is made up of three tokens T0, T1, and T2.

  1. We first run the mannequin with the unique tokenized sentence prefix. For the final token, we examine the likelihood of predicting token T0. We add this to the “probs” listing.
  2. Subsequent, we run a prediction on the prefix+T0 and examine the likelihood of token T1. We add this likelihood to the “probs” listing.
  3. Subsequent, we run a prediction on the prefix+T0+T1 and examine the likelihood of token T2. We add this likelihood to the “probs” listing.

The “probs” listing comprises the person possibilities of producing the tokens T0, T1, and T2 in sequence. Since these tokens correspond to the tokenization of the candidate phrase, we are able to multiply these possibilities to get the mixed likelihood of the candidate being a completion of the sentence prefix.

The code for computing the completion possibilities is proven beneath.

 def get_completion_probability(self, enter, completion, tok):
ids = tok.encode(enter).ids
ids = torch.tensor(ids, system=self.system).unsqueeze(0)
completion_ids = torch.tensor(tok.encode(completion).ids, system=self.system).unsqueeze(0)
probs = []
for i in vary(completion_ids.measurement(1)):
y = self.mannequin(ids)
y = y[0,:,-1].softmax(dim=0)
# prob is the likelihood of this completion.
prob = y[completion_ids[0,i]]
ids =[ids, completion_ids[:,i:i+1]], dim=1)
return torch.tensor(probs)

We will see some extra examples beneath.

[That ice-cream looks] [really] = 0.00709
[That ice-cream looks] [delicious] = 0.00264
[That ice-cream looks] [absolutely] = 0.00122
[That ice-cream looks] [real] = 0.00031
[That ice-cream looks] [fish] = 0.00004
[That ice-cream looks] [paper] = 0.00001
[That ice-cream looks] [atrocious] = 0.00000[Since we're heading] [toward] = 0.01052
[Since we're heading] [away] = 0.00344
[Since we're heading] [against] = 0.00035
[Since we're heading] [both] = 0.00009
[Since we're heading] [death] = 0.00000
[Since we're heading] [bubble] = 0.00000
[Since we're heading] [birth] = 0.00000

[Did I make] [a] = 0.22704
[Did I make] [the] = 0.06622
[Did I make] [good] = 0.00190
[Did I make] [food] = 0.00020
[Did I make] [color] = 0.00007
[Did I make] [house] = 0.00006
[Did I make] [colour] = 0.00002
[Did I make] [pencil] = 0.00001
[Did I make] [flower] = 0.00000

[We want a candidate] [with] = 0.03209
[We want a candidate] [that] = 0.02145
[We want a candidate] [experience] = 0.00097
[We want a candidate] [which] = 0.00094
[We want a candidate] [more] = 0.00010
[We want a candidate] [less] = 0.00007
[We want a candidate] [school] = 0.00003

[This is the definitive guide to the] [the] = 0.00089
[This is the definitive guide to the] [complete] = 0.00047
[This is the definitive guide to the] [sentence] = 0.00006
[This is the definitive guide to the] [rapper] = 0.00001
[This is the definitive guide to the] [illustrated] = 0.00001
[This is the definitive guide to the] [extravagant] = 0.00000
[This is the definitive guide to the] [wrapper] = 0.00000
[This is the definitive guide to the] [miniscule] = 0.00000

[Please can you] [check] = 0.00502
[Please can you] [confirm] = 0.00488
[Please can you] [cease] = 0.00002
[Please can you] [cradle] = 0.00000
[Please can you] [laptop] = 0.00000
[Please can you] [envelope] = 0.00000
[Please can you] [options] = 0.00000
[Please can you] [cordon] = 0.00000
[Please can you] [corolla] = 0.00000

[I think] [I've] = 0.00087
[I think] [over] = 0.00051
[I think] [ice] = 0.00001
[I think] [Oct] = 0.00000

[Please] [can] = 0.00428
[Please] [cab] = 0.00000

[I've scheduled this] [meeting] = 0.00077
[I've scheduled this] [messing] = 0.00000

These examples present the likelihood of the phrase finishing the sentence earlier than it. The candidates are sorted in reducing order of likelihood.

Since Transformers are slowly changing LSTM and RNN fashions for sequence-based duties, let’s check out what a Transformer mannequin for a similar goal would appear like.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button