Repository navigation

#

text-extraction

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Python
4593
11 天前
Go
2882
22 天前

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

Python
2275
4 天前

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Python
1611
4 个月前

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

955
2 年前

Heuristic based boilerplate removal tool

Python
790
6 个月前

This repository has moved! https://github.com/unidoc/unipdf

Go
709
6 年前

A self‑hosted search engine for documents. Help us improve Datashare by answering a survey on structured content: https://forms.gle/PYgusFsoBaMyzUec9

Java
650
12 小时前

Text Extraction, Rendering and Converting of PDF Documents

C++
539
6 个月前

A simple library and set of tools for parsing, modifying, and composing SRT files.

Python
519
1 年前

Parse PDFs into markdown using Vision LLMs

Python
414
6 个月前

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

Go
389
8 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
326
2 年前
Python
305
2 个月前

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

HTML
204
1 年前

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Python
196
8 个月前

Entity Disambiguation as text extraction (ACL 2022)

Python
182
3 年前