Repository navigation

#

pdf-to-text

enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Python
1193
3 天前

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Python
361
5 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
314
2 年前

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Python
92
23 天前

PDF text data extraction web app with OCR for scanned documents

Python
87
10 个月前

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

Jupyter Notebook
80
2 年前

cli for extracting text from PDF files (and maybe possibly tables)

C++
77
1 个月前

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python
50
1 个月前

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

C#
40
6 年前

A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Python
33
18 天前

Simple PHP PDF to Text class

PHP
24
1 年前

Simple pdf to text with python using PDFtk and PyPDF2

Python
20
2 年前