10-Transformers

AI dev
Published

March 15, 2025

Let’s create a transformer trained with shakespear work:

Get training text

with open('../../resources/data/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# get used chars
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("characters used by shakespear: "+''.join(chars))
print("this corresponds to: "+str(vocab_size)+" characters")
characters used by shakespear: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
this corresponds to: 65 characters

Tokenize

Here will tokenize by character, tiktoken (used by GPT) uses a subword tokenizer instead of char making gpt2 having a vocabulary size of 50257 possibilities

stoi = { ch:i for i,ch in enumerate(chars) } # map char to its position in chars
itos = { i:ch for i,ch in enumerate(chars) } # do the oposite

encode = lambda s: [stoi[c] for c in s] # encoder: lambda function to encode chars in a string
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: lambda function to decode values in a list

test_text="hello world"
encoded=encode(test_text)
print(encoded)
print(decode(encoded))
[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world

Now we can tokenize shakespeare: (we use torch to wrap our tokenized data so the calculation afterwards could run faster - allowing us to use both CPU and GPU)

import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])
torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])

to train a model we need training data and validation data. For our case we can take 90% of the dataset as training data and the remaining as validation data

n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

When training the transformer we dont treat the whole text at the same time (that would take too much computing power and actually be impossible). Common defaults are 128, 256, 512, or 1024 tokens but we’ll use 8 for the example so it doesn’t take too much resources.

For vectors like (18,47,56,57,58,1,15,47,58), this will train the tranformer that after 18 likely comes a 47, after 18,47 likely comes a 56 after 18,47,56 likely comes 57 and so on..

block_size = 8
train_data[:block_size+1]

x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")
when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58

continue tuto at 18:26

References

Followed tutorial https://www.youtube.com/watch?v=kCc8FmEb1nY&t=848s&ab_channel=AndrejKarpathy https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=h5hjCcLDr2WC

other sources https://github.com/karpathy/nanoGPT https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt