ML Tokenization

16 January 2021

Questions
- Write about KV optimizations
- Can we add a new token and learn it effortlessly?
- Does tokenization affect multilingual NLP performance (performance and compute)?
- Why might byte-level tokenization be more robust across languages and domains?
- How do special tokens (e.g., [CLS], <s>, </s>, [PAD]) influence model training and attention behavior?
- How does tokenization differ for text, code, and speech data, and why?
- Why do some tokenizers prefer right padding while others use left padding?
- What happens if you fine-tune a model with a tokenizer different from the one used during pretraining?
- How do tokenizers for code-generation models differ from nlp/text generation tokenizers?
- Why do we need discrete tokenization - continuous, character-based, or byte-embedding approaches?
- Is it possible to have self-tokenizers or adaptive tokenizers in model architecture?
breaks down text (or other data) into smaller units called tokens (ints) before passing them into a ml model
- machines don’t understand the text, they can work with numbers
- Embedding Layer converts the token ids into dense vectors (embeddings).
- Embeddings capture semantic meaning
tokens are not necessarily words (it used to be)
- subwords (e.g. “play” + “ing”)
- characters (e.g. “p” + “l” + “a” + “y”)
- even bytes (for multilingual handling and non-texts handling)
LLMs (like GPT, Claude, Gemini) typically use BPE or SentencePiece. They are flexible - no OOV
most common tokenizers
- BPE - byte pair encoding
- Wordpiece
- Sentencepiece
number of unique token ids (vocabulary size) is between 30,000 to 100,000
if the vocab is too small, then the words are broken down and split into longer sequences.
- the model must learn and longers inputs for text!
- more tokens is more compute
- worst semantic understanding ~ assume if a word is broken into individual chars - which may not itself carry useful information.
if the vocab is too large, then vocabulary includes lot of rarely used tokens and duplicates/similar tokens.
- having token for every number mentioned in the webpage, dates, phone numbers, address etc!!
- similar words might have different token ids - tokenize, tokenization, tokenizing, tokenized
- poor learning, generalization
special tokens exist for -
- [PAD] – padding
- [UNK] – unknown / out-of-vocab
- [CLS] – start-of-sequence (classification token)
- [SEP] – separator between sentences
- [MASK] – masked token for MLM
- <bos> / <eos> – beginning/end of sequence (GPT-style models)
- System/user/assistant role tokens in chat models
- Coding agents some special tokens for reserved words in languages (including tabs for pythons)
padding or truncation
- right padding is most common and default in hf. used especially in classification, seq2seq, BERT/T5 ~
- left padding is useful for batched autoregressive generation with decoder-only models (GPT-like, code models, Qwen, LLaMA, etc.). improves KV-Optimizations.
  - most recent tokens (at the end) are most relevant. In autoregressive generation, the model only “looks left.” The latest tokens are the most informative for predicting the next word.
Papers
BPE
- todo
EXAMPLE -

from transformers import AutoTokenizer

#   "codellama/CodeLlama-7b-hf"
#   "bigcode/starcoder2-3b"
#   "deepseek-ai/deepseek-coder-6.7b-base"
model_name = "bert-base-uncased"
# model_name = "deepseek-ai/deepseek-coder-6.7b-base"
tok = AutoTokenizer.from_pretrained(model_name)

print(tok)

print("All special tokens:", tok.all_special_tokens)
print("All special IDs:", tok.all_special_ids)
print("Special token map:", tok.special_tokens_map)
print("Base vocab size:", tok.vocab_size)
print("Total tokens (including added):", len(tok))

text = "Tokenization is Cool! 😎"
# text = """
#     import numpy as np; 
#     import pandas as pd; 
#     import matplotlib.pyplot as plt; 
#     import seaborn as sns;

#     a = np.zeros((2,10))
#     b = 1.001
# """

tokens = tok.tokenize(text)
print("Token texts:", tokens)

token_ids = tok.encode(text, add_special_tokens=True)
print("Token ids:", token_ids)
tokens = tok.convert_ids_to_tokens(token_ids)
print("Token texts from ids:", tokens)

bert-base-uncased

BertTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
All special tokens: ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
All special IDs: [100, 102, 0, 101, 103]
Special token map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
Base vocab size: 30522
Total tokens (including added): 30522
Token texts: ['token', '##ization', 'is', 'cool', '!', '[UNK]']
Token ids: [101, 19204, 3989, 2003, 4658, 999, 100, 102]
Token texts from ids: ['[CLS]', 'token', '##ization', 'is', 'cool', '!', '[UNK]', '[SEP]']

deepseek-ai/deepseek-coder-6.7b-base

LlamaTokenizerFast(name_or_path='deepseek-ai/deepseek-coder-6.7b-base', vocab_size=32000, model_max_length=16384, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<｜begin▁of▁sentence｜>', 'eos_token': '<｜end▁of▁sentence｜>', 'pad_token': '<｜end▁of▁sentence｜>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	32000: AddedToken("õ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32001: AddedToken("÷", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32002: AddedToken("Á", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32003: AddedToken("ý", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32004: AddedToken("À", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32005: AddedToken("ÿ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32006: AddedToken("ø", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32007: AddedToken("ú", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32008: AddedToken("þ", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32009: AddedToken("ü", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32010: AddedToken("ù", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32011: AddedToken("ö", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32012: AddedToken("û", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32013: AddedToken("<｜begin▁of▁sentence｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	32014: AddedToken("<｜end▁of▁sentence｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	32015: AddedToken("<｜fim▁hole｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32016: AddedToken("<｜fim▁begin｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32017: AddedToken("<｜fim▁end｜>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32018: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32019: AddedToken("<|User|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32020: AddedToken("<|Assistant|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
	32021: AddedToken("<|EOT|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False),
}
)
All special tokens: ['<｜begin▁of▁sentence｜>', '<｜end▁of▁sentence｜>']
All special IDs: [32013, 32014]
Special token map: {'bos_token': '<｜begin▁of▁sentence｜>', 'eos_token': '<｜end▁of▁sentence｜>', 'pad_token': '<｜end▁of▁sentence｜>'}
Base vocab size: 32000
Total tokens (including added): 32022
Token texts: ['Ċ', 'ĠĠĠ', 'Ġimport', 'Ġnum', 'py', 'Ġas', 'Ġnp', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġpand', 'as', 'Ġas', 'Ġp', 'd', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġmat', 'plot', 'lib', '.', 'py', 'plot', 'Ġas', 'Ġpl', 't', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġse', 'ab', 'orn', 'Ġas', 'Ġs', 'ns', ';', 'Ċ', 'Ċ', 'ĠĠĠ', 'Ġa', 'Ġ=', 'Ġnp', '.', 'zer', 'os', '((', '2', ',', '1', '0', '))', 'Ċ', 'ĠĠĠ', 'Ġb', 'Ġ=Ġ', '1', '.', '0', '0', '1', 'Ċ']
Token ids: [32013, 185, 315, 1659, 1181, 4016, 372, 21807, 26, 207, 185, 315, 1659, 21866, 281, 372, 265, 67, 26, 207, 185, 315, 1659, 1575, 13371, 2875, 13, 4016, 13371, 372, 568, 83, 26, 207, 185, 315, 1659, 386, 356, 1745, 372, 252, 3585, 26, 185, 185, 315, 245, 405, 21807, 13, 9888, 378, 5930, 17, 11, 16, 15, 1435, 185, 315, 270, 1412, 16, 13, 15, 15, 16, 185]
Token texts from ids: ['<｜begin▁of▁sentence｜>', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġnum', 'py', 'Ġas', 'Ġnp', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġpand', 'as', 'Ġas', 'Ġp', 'd', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġmat', 'plot', 'lib', '.', 'py', 'plot', 'Ġas', 'Ġpl', 't', ';', 'Ġ', 'Ċ', 'ĠĠĠ', 'Ġimport', 'Ġse', 'ab', 'orn', 'Ġas', 'Ġs', 'ns', ';', 'Ċ', 'Ċ', 'ĠĠĠ', 'Ġa', 'Ġ=', 'Ġnp', '.', 'zer', 'os', '((', '2', ',', '1', '0', '))', 'Ċ', 'ĠĠĠ', 'Ġb', 'Ġ=Ġ', '1', '.', '0', '0', '1', 'Ċ']

Qwen/Qwen1.5-1.8B-Chat

Qwen2TokenizerFast(name_or_path='Qwen/Qwen1.5-1.8B-Chat', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
All special tokens: ['<|im_end|>', '<|endoftext|>', '<|im_start|>']
All special IDs: [151645, 151643, 151644]
Special token map: {'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>']}
Base vocab size: 151643
Total tokens (including added): 151646
Token texts: ['Token', 'ization', 'Ġis', 'ĠCool', '!', 'ĠðŁĺ', 'İ']
Token ids: [3323, 2022, 374, 23931, 0, 26525, 236]
Token texts from ids: ['Token', 'ization', 'Ġis', 'ĠCool', '!', 'ĠðŁĺ', 'İ']