Repository navigation

#

bpe

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

Python
2247
1 年前

Unsupervised text tokenizer focused on computational efficiency

C++
972
1 年前

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (o1, o3, o4, gpt-4o, gpt-4, etc.). Port of OpenAI's tiktoken with additional features.

TypeScript
605
2 个月前

Ready-made tokenizer library for working with GPT and tiktoken

Rust
330
10 天前

Fast and customizable text tokenization library with BPE and SentencePiece support

C++
314
4 个月前

Explains nlp building blocks in a simple manner.

Jupyter Notebook
251
6 年前

Byte Pair Encoding for Python!

Python
231
3 年前

Train a language model to chat like you using your personal conversations from WhatsApp, Telegram, Signal, or other platforms.

Jupyter Notebook
187
25 天前

nfelib - bindings Python para e ler e gerir XML de NF-e, NFS-e nacional, CT-e, MDF-e, BP-e

Python
167
14 天前

Fast bare-bones BPE for modern tokenizer training

Python
164
2 个月前

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

Go
81
9 个月前

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Python
54
6 年前

Simple-to-use scoring function for arbitrarily tokenized texts.

Python
45
6 个月前

Kotlin multiplatform BPE tokenizer library for OpenAI models

Kotlin
36
7 个月前

BBPE 底层实现

Python
31
1 年前

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

Rust
29
5 个月前

High performance unsupervised text tokenization for Ruby

Ruby
21
2 年前

✂️ OpenAI's tiktoken tokenizer written in Go

Go
20
7 个月前