Repository navigation

pdf-to-text

Website
Wikipedia

Get your documents ready for gen AI

人工智能 convert documents pdf tables document-parser document-parsing docx HTML Markdown pdf-converter pdf-to-json pdf-to-text pptx xlsx

Python

40576

2838

2 天前

Unstructured-IO / unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

深度学习 document-parsing 机器学习自然语言处理 OCR information-retrieval data-pipelines preprocessing pdf-to-text pdf pdf-to-json document-image-analysis donut document-image-processing document-parser docx langchain 大语言模型

HTML

12817

1049

8 天前

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

document Parsing pdf pdf-document-processor pptx structured-data document-parser document-parsing docx-to-markdown pdf-to-excel pdf-to-json pdf-to-text ppt-to-json tables ppt-to-markdown pdf-to-markdown

TypeScript

4162

455

3 小时前

enoch3712 / ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python

1427

138

1 个月前

Academic-Hammer / SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

table-structure-recognition pdf-to-text

Python

376

5 年前

pd3f / pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

pdf text-extraction pdf-to-text pipeline 机器学习 OCR language-model extract-text parsr Python

HTML

327

2 年前

GiftMungmeeprued / document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

data-pipeline document-image-processing document-parser document-parsing langchain OCR pdf pdf-to-text preprocessing

156

3 个月前

shoryasethia / markdrop

A Python package for converting PDFs to markdown while extracting images and tables, generate descriptive text descriptions for extracted tables/images using several LLM clients. And many more functionalities. Markdrop is available on PyPI.

Open Source pypi-package image-to-text 大语言模型 pdf-to-markdown pdf-to-text table-to-text agents

Python

154

3 个月前

NanoNets / ocr-python

OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.

OCR tesseract pdf Python pdf-to-json pdf-to-text image-to-text

Jupyter Notebook

116

3 年前

nainiayoub / pdf-text-data-extractor

PDF text data extraction web app with OCR for scanned documents

pdf-to-text Streamlit streamlit-webapp text-extraction Python OCR ocr-python pdf

Python

1 年前

datalogics / adobe-pdf-library-samples

Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library

OCR pdf pdf-conversion pdf-converter pdf-document pdf-generation pdf-lib pdf-manipulation pdf-merger pdf-parser pdf-split pdf-to-text pdf-tools pdfa

2 年前

BitMiracle / Docotic.Pdf.Samples

C# and VB.NET samples for Docotic.Pdf library

pdf-library pdf-to-text pdf-signature pdf-generation pdf-merge extract-text net-core pdf-manipulation pdf-parser html-to-pdf

Visual Basic .NET

2 个月前

galkahana / pdf-text-extraction

cli for extracting text from PDF files (and maybe possibly tables)

pdf pdf-to-text

C++

4 个月前

mbzuai-oryx / KITAB-Bench

[ACL 2025 🔥] A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

arabic benchmark layout-detection OCR pdf-to-text table-detection vlms vqa

Python

4 个月前

papercast-dev / papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

arxiv Python dag 自然语言处理 pdf-converter pdf-document-processor pipeline document-parser document-parsing pdf-to-text podcast tts

Python

7 个月前

iditectweb / converter

Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework

pdf-to-text html-to-pdf

7 年前