Repository navigation

#

text-extraction

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python
4763
22 天前
Go
2919
1 个月前

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

HTML
2417
1 天前

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python
1625
6 个月前

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

955
2 年前

Heuristic based boilerplate removal tool

Python
796
7 个月前

This repository has moved! https://github.com/unidoc/unipdf

Go
709
6 年前

Text Extraction, Rendering and Converting of PDF Documents

C++
536
1 个月前

A simple library and set of tools for parsing, modifying, and composing SRT files.

Python
523
2 年前

Parse PDFs into markdown using Vision LLMs

Python
429
4 小时前

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

Go
389
8 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
327
2 年前
Python
312
3 个月前

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

HTML
205
1 年前

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Python
196
10 个月前

Entity Disambiguation as text extraction (ACL 2022)

Python
182
3 年前