Repository navigation

#

text-extraction

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python
4150
1 个月前
Go
2763
1 个月前

A text extraction library supporting PDFs, images, office documents and more

Python
1776
10 天前

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python
1574
6 天前

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

952
2 年前

Heuristic based boilerplate removal tool

Python
766
2 个月前

This repository has moved! https://github.com/unidoc/unipdf

Go
709
6 年前

Text Extraction, Rendering and Converting of PDF Documents

C++
533
2 个月前

A simple library and set of tools for parsing, modifying, and composing SRT files.

Python
502
1 年前

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

Go
388
8 年前

Parse PDFs into markdown using Vision LLMs

Python
345
2 个月前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
314
2 年前

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

HTML
204
1 年前

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Python
186
4 个月前

Entity Disambiguation as text extraction (ACL 2022)

Python
181
3 年前

AWS Lambda functions to extract text from various binary formats.

Python
177
7 年前