Repository navigation

text-extraction

Website
Wikipedia

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

web-scraping text-extraction 自然语言处理 text-mining 爬虫 text-preprocessing article-extractor readability scraping html-to-markdown corpus-tools rss-feed news-aggregator rag 大语言模型

Python

4763

318

22 天前

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

Python lsa textteaser html-page summarizer pagerank-algorithm reduction text-extraction html-extraction html-extractor summarization summary 自然语言处理

Python

3629

537

1 个月前

unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

Go pdf pdf-library pdf-generation pdf-document-processor text-extraction pdf-manipulation signing pdf-sign pdf-generator

2919

274

1 个月前

Goldziher / kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

OCR text-extraction async document-intelligence mcp pandoc Python rag table-extraction tesseract

HTML

2417

102

1 天前

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python Parsing text-extraction mime buffer memex text-recognition detection recognition 自然语言处理 nlp-library COVID-19 extraction

Python

1625

244

6 个月前

whitelok / image-text-localization-recognition

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

text-recognition text-detection convolutional-neural-networks 深度学习 OCR text-extraction 机器学习 Awesome Lists

955

233

2 年前

miso-belica / jusText

Heuristic based boilerplate removal tool

Python text-extraction html-parser html-parsing

Python

796

7 个月前

unidoc / unidoc

This repository has moved! https://github.com/unidoc/unipdf

Go pdf pdf-library pdf-files text-extraction pdf-invoice

709

6 年前

ICIJ / datashare

A self‑hosted search engine for documents

named-entity-recognition text-extraction extract investigative-journalism elasticsearch Docker web-gui

Java

659

3 天前

ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

text-extraction R rstats pdf-files r-package

C++

536

1 个月前

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

srt subtitle subtitles text-extraction Python mit-license 工具命令行界面 command-line-tool Library

Python

523

2 年前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python

429

4 小时前

flairNLP / fundus

A very simple news crawler with a funny name

corpus 爬虫自然语言处理 Python RSS scraper sitemap text-extraction web-scraping corpus-tools datasets image-classification

Python

406

5 天前

shixzie / nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

自然语言处理 Parsing Go text-extraction text

389

8 年前

pd3f / pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML

327

2 年前

py-pdf / benchmarks

Benchmarking PDF libraries

benchmark data-extraction mupdf pdf pypdf2 text-extraction

Python

312

3 个月前

Goldziher / html-to-markdown

HTML to markdown converter

html-converter markdown-converter rag text-extraction text-processing

Python

260

1 天前

bookieio / breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Python text-mining text-extraction html-extraction html-extractor html-parsing

HTML

205

1 年前

weareprestatech / hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf Python text-extraction text-search

Python

196

10 个月前

SapienzaNLP / extend

Entity Disambiguation as text extraction (ACL 2022)

自然语言处理 Entity resolution text-extraction PyTorch acl

Python

182

3 年前