Repository navigation

#

Parsing

维基百科

相关主题

ANTLRLR parser

A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.

A small library for converting tokenized PHP source code into XML (and potentially other formats)

PHP
5187
1 年前
TypeScript
2600
12 小时前
natasha/natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

Python
1246
6 个月前
dqbd/tiktokenizer

Online playground for OpenAPI tokenizers

TypeScript
1117
2 个月前

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

Python
961
2 个月前
Go
857
19 天前

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

JavaScript
848
2 年前

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

Python
717
24 天前

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Swift
688
7 年前

支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin

C++
670
2 天前

数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

Java
670
3 个月前

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

Python
668
1 年前

专注于可解释的NLP技术 An NLP Toolset With A Focus on Explainable Inference

Java
623
4 年前

Open Korean Text Processor - An Open-source Korean Text Processor

Scala
623
1 年前

The fast scanner generator for Java™ with full Unicode support

Java
603
4 个月前

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

Go
576
10 个月前

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Jupyter Notebook
569
2 个月前

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.

TypeScript
557
2 个月前

🌿 NodeJS PHP Parser - extract AST or tokens

JavaScript
542
3 天前