Repository navigation

document-parser

Website
Wikipedia

RAGFlow is a leading open-source Retrieval-Augmented Generation (RAG) engine that fuses cutting-edge RAG with Agent capabilities to create a superior context layer for LLMs

TypeScript

65500

6872

1 天前

docling-project / docling

Get your documents ready for gen AI

人工智能 convert documents pdf tables document-parser document-parsing docx HTML Markdown pdf-converter pdf-to-json pdf-to-text pptx xlsx

Python

40576

2838

2 天前

Unstructured-IO / unstructured

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

深度学习 document-parsing 机器学习自然语言处理 OCR information-retrieval data-pipelines preprocessing pdf-to-text pdf pdf-to-json document-image-analysis donut document-image-processing document-parser docx langchain 大语言模型

HTML

12817

1049

8 天前

freeok / so-novel

小说下载｜网文下载 | 网络小说

content-export document-parser ebook offline-reader 命令行界面 tui novel

Java

4442

384

3 天前

Marker-Inc-Korea / AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

analysis automl benchmarking document-parser embeddings evaluation 大语言模型 llm-evaluation llm-ops Open Source ops optimization pipeline Python qa rag rag-evaluation retrieval-augmented-generation

Python

4332

348

7 天前

run-llama / llama_cloud_services

Knowledge Agents and Management in the Cloud

document Parsing pdf pdf-document-processor pptx structured-data document-parser document-parsing docx-to-markdown pdf-to-excel pdf-to-json pdf-to-text ppt-to-json tables ppt-to-markdown pdf-to-markdown

TypeScript

4162

455

3 小时前

Filimoa / open-parse

Improved file parsing for LLM’s

document-structure table-detection document-parser layout-parsing

Python

3105

135

1 年前

deepdoctection / deepdoctection

A Repo For Document AI

document-parser document-image-analysis table-recognition OCR document-ai document-understanding Python document-layout-analysis table-detection PyTorch Tensorflow layoutlm 自然语言处理

Python

2964

170

19 天前

liweiphys / layra

LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.

agent document-parser knowledge-base gpt-4o 大语言模型 FastAPI workflow

TypeScript

819

6 天前

NanoNets / docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

大语言模型 Markdown OCR pdf-to-markdown structured-data 人工智能 document-parser document-parsing pdf-parser pdf-to-json tables

Python

696

24 天前

opendataloader-project / opendataloader-pdf

Safe, Open, High-Performance — PDF for AI

JSON Markdown pdf 人工智能 document-parser document-parsing documents HTML ocr-recognition pdf-converter pdf-to-json pdf-to-markdown recognition tables dataloader SDK

Java

681

1 天前

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Python

429

3 小时前

GiftMungmeeprued / document-parsers-list

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

data-pipeline document-image-processing document-parser document-parsing langchain OCR pdf pdf-to-text preprocessing

156

3 个月前

marieai / marie-ai

Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

OCR optical-character-recognition Docker document-layout-analysis document-parser Python PyTorch table-detection table-recognition

Python

1 天前

papercast-dev / papercast

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

arxiv Python dag 自然语言处理 pdf-converter pdf-document-processor pipeline document-parser document-parsing pdf-to-text podcast tts

Python

7 个月前

JPLeoRX / opencv-text-deskew

Tutorial on how to deskew (straighten) text images

Python OpenCV 机器视觉图像处理 document-parser opencv-python 教程

Python

4 年前

LianjiaTech / bella-domify

文档解析（Document Parser），支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式，高效提取与解析内容，生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser，助力 RAG、知识库、全文检索等智能应用。

document-parser pdf-parser Parser

Python

18 天前

InvoiceableAI / Invoiceable

The invoice, document, and resume parser powered by AI.

人工智能 document-parser documents experimental invoice invoices Python resume resume-parser resumes

Python

10 个月前

graphlit / graphlit

Graphlit Platform

聊天机器人 copilot data 框架大语言模型 rag vector-database document-parser information-retrieval 自然语言处理 pdf-to-json pdf-to-text

2 年前

decisionfacts / semantic-ai

An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

approximate-nearest-neighbor-search 深度神经网络 document-parser docx FastAPI inference-api llama2 大语言模型机器学习 OCR openai pdf rag retrieval-augmented-generation semantic-search vector-database openai-api

Python

1 年前