Repository navigation

#

sentencepiece

Fast and customizable text tokenization library with BPE and SentencePiece support

C++
319
6 个月前

🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.

Python
255
5 个月前

使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。

Python
119
2 年前

Free and open source pre-trained translation models, including Kurdish, Samoan, Xhosa, Lao, Corsican, Cebuano, Galician, Russian, Belarusian and Yoruba.

87
2 个月前

Minimal example of using a traced huggingface transformers model with libtorch

C++
36
5 年前

Go implementation of the SentencePiece tokenizer

Go
35
1 年前

Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and WordPiece tokenization in JavaScript, Python and Rust.

Rust
34
6 个月前

R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece

C++
25
3 年前

Rust binding for the sentencepiece library

Rust
22
1 个月前

Extremely simple and understandable GPT2 implementation with minor tweaks

Python
21
6 年前

Learning BPE embeddings by first learning a segmentation model and then training word2vec

Python
19
3 年前

Use SentencePiece in Swift for tokenization and detokenization.

Swift
15
3 个月前

sentencepiece port to webassembly with browser compatibility

TypeScript
13
1 年前

Trained Decoder only model on large BookCorpus Dataset. First time!

Jupyter Notebook
11
1 年前

To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.

Jupyter Notebook
9
6 年前

sentencepiece C# wrapper

C++
6
6 年前
Python
4
1 年前