Repository navigation

pdf-parser

Website
Wikipedia

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

extract-data layout-analysis OCR Parser pdf pdf-converter Python document-analysis pdf-parser pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag ai4science

Python

42157

3460

18 小时前

py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

pypdf2 pdf Python pdf-parser pdf-parsing pdf-manipulation pdf-documents help-wanted

Python

9326

1488

4 小时前

bytedance / Dolphin

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

document-analysis layout-analysis OCR Parser pdf pdf-converter pdf-parser Python vlm-ocr

Python

5578

442

7 天前

dromara / yft-design

yft-design is a powerful, visually stunning online design tool built with Vue3, fabric.js, and Element Plus. 基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具，具备海报设计和图片编辑功能。适用于多种场景，如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编辑等。

element-plus fabricjs canvas-editor clipper pdf-parser online-editor pdf-editor

TypeScript

1384

279

6 天前

yobix-ai / extractous

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

extraction pdf tika unstructured unstructured-data data-pipelines docx etl etl-pipelines 大语言模型机器学习自然语言处理 OCR pdf-parser rag Rust

Rust

1216

8 个月前

adithya-s-k / marker-api

Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.

FastAPI pdf-converter pdf-files pdf-parser pdf-parsing API REST API

Python

886

105

10 个月前

drmingler / docling-api

Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.

API FastAPI markdown-parser pdf-conversion pdf-converter pdf-parser pdf-parsing pdf-to-markdown

Python

673

6 个月前

ispras / dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

doc docx odt documents excel pdf txt OCR scanned-documents table-recognition HTML html-parser pdf-parser document-analysis

Python

582

13 天前

titipata / scipdf_parser

Python PDF parser for scientific publications: content and figures

pdf Parser pdf-parser

Python

422

1 年前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python

414

6 个月前

NanoNets / docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

大语言模型 Markdown OCR pdf-to-markdown structured-data 人工智能 document-parser document-parsing pdf-parser pdf-to-json tables

Python

332

2 天前

michelcrypt4d4mus / pdfalyzer

Analyze PDFs. With colors. And Yara.

malware-analysis pdf pdf-documents pdf-parser

YARA

284

17 天前

lazyFrogLOL / llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

大语言模型自然语言处理 OCR rag chunking document-analysis pdf-parser

Python

272

1 年前

sylphxltd / pdf-reader-mcp

An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.

ai-agent mcp Node.js pdf pdf-parser stdio TypeScript

TypeScript

212

1 天前