On this tutorial, we’ll discover ways to create a customized tokenizer utilizing the tiktoken library. The method entails loading a pre-trained tokenizer mannequin, defining each base and particular tokens, initializing the tokenizer with a particular common expression for token splitting, and testing its performance by encoding and decoding some pattern textual content. This setup is crucial for NLP duties requiring exact management over textual content tokenization.
from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import json
Right here, we import a number of libraries important for textual content processing and machine studying. It makes use of Path from pathlib for straightforward file path administration, whereas tiktoken and load_tiktoken_bpe facilitate loading and dealing with a Byte Pair Encoding tokenizer.
tokenizer_path = "./content material/tokenizer.mannequin"
num_reserved_special_tokens = 256
mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
num_base_tokens = len(mergeable_ranks)
special_tokens = [
"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|finetune_right_pad_id|>",
"<|step_id|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|eom_id|>",
"<|eot_id|>",
"<|python_tag|>",
]
Right here, we set the trail to the tokenizer mannequin, specifying 256 reserved particular tokens. It then masses the mergeable ranks, which kind the bottom vocabulary, calculates the variety of base tokens, and defines a listing of particular tokens for marking textual content boundaries and different reserved functions.
reserved_tokens = [
f"<|reserved_special_token_{2 + i}|>"
for i in range(num_reserved_special_tokens - len(special_tokens))
]
special_tokens = special_tokens + reserved_tokens
tokenizer = tiktoken.Encoding(
title=Path(tokenizer_path).title,
pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+",
mergeable_ranks=mergeable_ranks,
special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)
Now, we dynamically create extra reserved tokens to achieve 256, then append them to the predefined particular tokens checklist. It initializes the tokenizer utilizing tiktoken. Encoding with a specified common expression for splitting textual content, the loaded mergeable ranks as the bottom vocabulary, and mapping particular tokens to distinctive token IDs.
#-------------------------------------------------------------------------
# Check the tokenizer with a pattern textual content
#-------------------------------------------------------------------------
sample_text = "Whats up, it is a take a look at of the up to date tokenizer!"
encoded = tokenizer.encode(sample_text)
decoded = tokenizer.decode(encoded)
print("Pattern Textual content:", sample_text)
print("Encoded Tokens:", encoded)
print("Decoded Textual content:", decoded)
We take a look at the tokenizer by encoding a pattern textual content into token IDs after which decoding these IDs again into textual content. It prints the unique textual content, the encoded tokens, and the decoded textual content to verify that the tokenizer works accurately.
Right here, we encode the string “Hey” into its corresponding token IDs utilizing the tokenizer’s encoding technique.
In conclusion, following this tutorial will train you how you can arrange a customized BPE tokenizer utilizing the TikToken library. You noticed how you can load a pre-trained tokenizer mannequin, outline each base and particular tokens, and initialize the tokenizer with a particular common expression for token splitting. Lastly, you verified the tokenizer’s performance by encoding and decoding pattern textual content. This setup is a basic step for any NLP undertaking that requires personalized textual content processing and tokenization.
Right here is the Colab Pocket book for the above undertaking. Additionally, don’t neglect to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 75k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.