Repository navigation

#

bpe

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

Python
2230
8 个月前

Unsupervised text tokenizer focused on computational efficiency

C++
965
1 年前

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.

TypeScript
557
2 个月前

Fast and customizable text tokenization library with BPE and SentencePiece support

C++
302
4 天前

Ready-made tokenizer library for working with GPT and tiktoken

Rust
301
18 天前

Explains nlp building blocks in a simple manner.

Jupyter Notebook
251
6 年前

Byte Pair Encoding for Python!

Python
228
3 年前

nfelib - bindings Python para e ler e gerir XML de NF-e, NFS-e nacional, CT-e, MDF-e, BP-e

Python
157
3 天前

Fast bare-bones BPE for modern tokenizer training

Python
153
17 天前

Go BPE tokenizer (Encoder+Decoder) for GPT2 and GPT3

Go
79
5 个月前

Subword Encoding in Lattice LSTM for Chinese Word Segmentation

Python
53
6 年前

Simple-to-use scoring function for arbitrarily tokenized texts.

Python
39
2 个月前

Kotlin multiplatform BPE tokenizer library for OpenAI models

Kotlin
31
3 个月前

BBPE 底层实现

Python
25
1 年前

High performance unsupervised text tokenization for Ruby

Ruby
21
1 年前

✂️ OpenAI's tiktoken tokenizer written in Go

Go
19
3 个月前

Learning BPE embeddings by first learning a segmentation model and then training word2vec

Python
19
2 年前

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

Rust
19
1 个月前