Repository navigation

#

sentencepiece

Fast and customizable text tokenization library with BPE and SentencePiece support

C++
302
4 天前

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Python
242
1 年前

使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。

Python
117
2 年前

Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Yiddish, Swahili, Russian, Belarusian and Yoruba.

48
2 个月前

Minimal example of using a traced huggingface transformers model with libtorch

C++
35
5 年前

Go implementation of the SentencePiece tokenizer

Go
28
7 个月前

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

C++
25
2 年前

Extremely simple and understandable GPT2 implementation with minor tweaks

Python
21
5 年前

Rust binding for the sentencepiece library

Rust
20
1 天前

Learning BPE embeddings by first learning a segmentation model and then training word2vec

Python
19
2 年前

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

Rust
19
1 个月前

sentencepiece port to webassembly with browser compatibility

TypeScript
13
6 个月前

To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.

Jupyter Notebook
9
5 年前

Use SentencePiece in Swift for tokenization and detokenization.

Swift
8
2 个月前

sentencepiece C# wrapper

C++
5
6 年前
Python
4
1 年前

Bengali language Tokenizer (SentencePiece)

Python
4
5 年前