Repository navigation

#

pdf-to-text

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

HTML
12407
6 天前
enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Python
1373
17 天前

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Python
374
5 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
326
2 年前

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Python
144
1 个月前

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

139
1 个月前

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

Jupyter Notebook
112
3 年前

PDF text data extraction web app with OCR for scanned documents

Python
88
1 年前

cli for extracting text from PDF files (and maybe possibly tables)

C++
74
2 个月前

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python
52
5 个月前

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Python
46
3 个月前

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

C#
40
7 年前

Simple PHP PDF to Text class

PHP
24
2 年前