Repository navigation

Parsing

Website
Wikipedia: 维基百科

相关主题

A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.

theseer / tokenizer

A small library for converting tokenized PHP source code into XML (and potentially other formats)

PHP Parsing XML

PHP

5203

2 年前

Chevrotain / chevrotain

Parser Building Toolkit for JavaScript

JavaScript TypeScript parser-library Parsing grammars Open Source

TypeScript

2682

217

2 天前

dqbd / tiktokenizer

Online playground for OpenAPI tokenizers

ChatGPT Next openai T3 Stack Parsing

TypeScript

1349

154

5 个月前

roshan-research / hazm

Persian NLP Toolkit

自然语言处理 Python persian persian-nlp dependency-parser embeddings text-processing Parsing farsi CSS Resets pos-tagging

Python

1329

198

1 年前

natasha / natasha

Solves basic Russian NLP tasks, API for lower level Natasha projects

自然语言处理 russian Parsing embeddings syntax ner visualization Python

Python

1277

109

1 年前

lovit / soynlp

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

自然语言处理 Parsing

Python

979

183

5 个月前

ikawaha / kagome

Self-contained Japanese Morphological Analyzer written in pure Go

japanese Parsing nlp-library japanese-language pos-tagging segmentation morphological-analysis korean Hacktoberfest

905

2 天前

no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

JavaScript Parsing Regular expression

JavaScript

866

2 年前

wangfenjin / simple

支持中文和拼音的 SQLite fts5 全文搜索扩展｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin

sqlite3 Parsing chinese pinyin SQLite C++

C++

739

102

4 个月前

BLKSerene / Wordless

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

corpus corpus-linguistics corpus-tools corpus-processing literature translation Parsing tagger lemmatizer dependency-parser

Python

736

2 天前

mathewsanders / Mustard

🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.

Swift Parsing

Swift

687

7 年前

risesoft-y9 / Data-Labeling

数据标注是一款专门对文本数据进行处理和标注的工具，通过简化快捷的文本标注流程和动态的算法反馈，支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础，再由自动标注反哺人工标注，最后由人工标注进行纠偏，从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

chinese Docker elasticsearch Java nacos springboot2 Parsing Vue.js

Java

686

102

3 个月前

cbaziotis / ekphrasis

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).

自然语言处理 text-processing nlp-library spelling-correction Parsing tokenization word-segmentation

Python

672

4 个月前

open-korean-text / open-korean-text

Open Korean Text Processor - An Open-source Korean Text Processor

korean 自然语言处理 text-processing Parsing

Scala

644

2 年前

niieani / gpt-tokenizer

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT models (o1, o3, o4, gpt-4o, gpt-4, etc.). Port of OpenAI's tiktoken with additional features.

bpe gpt-2 gpt-3 机器学习 gpt-4 Parsing decoder encoder openai gpt-4o

TypeScript

625

2 个月前