Repository navigation

document-processing

Website
Wikipedia

A system for agentic LLM-powered data processing and ETL

data etl 大语言模型 Python data-pipelines elt workflow agents semantic-data document-processing unstructured-data unstructured-data-analysis document-analysis

Python

2948

307

11 天前

enoch3712 / ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

人工智能大语言模型自然语言处理 OCR openai Python document-image-analysis document-intelligence document-parsing document-processing langchain 机器学习 pdf pdf-to-text

Python

1428

138

1 个月前

dhlab-epfl / dhSegment

Generic framework for historical document processing

Tensorflow segmentation historical-data Python document-processing

Python

379

113

4 年前

eclaire-labs / eclaire

Local-first, open-source AI assistant for your data — unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

人工智能 ai-assistant 自动化 bookmark-manager bookmarks data-extraction document-processing 大语言模型 local-first note-taking OCR on-device-ai Open Source personal-knowledge-management 隐私 REST API 自托管 task-management web-archiving

TypeScript

302

3 天前

ucbepic / TWIX

TWIX is an open-source data extraction tool that reconstructs structured data from documents at scale, accurately and at low cost, by inferring the shared underlying visual template across documents

document-data-extraction document-processing

Python

206

4 个月前

awslabs / project-lakechain

⚡ Cloud-native, AI-powered, document processing pipelines on AWS.

Amazon Web Services 机器视觉 document-processing generative-ai 机器学习自然语言处理 retrieval-augmented-generation Serverless Hacktoberfest aws-cdk

TypeScript

186

7 个月前

formkiq / formkiq-core

A full-featured Document Management Platform / Document Layer for your application, providing storage, discovery, processing, and retrieval. Deploys directly into your Amazon Web Services Cloud. Please 🌟 star to support our work!

amazon-web-services Amazon Web Services cloud-storage dms document-database document-management document-management-system document-processing headless Serverless OCR optical-character-recognition

Java

142

6 小时前

Tele-AI / doc-ops-mcp

MCP server for seamless document format conversion and processing

document-conversion document-processing docx-to-pdf file-converter markdown-converter pdf-conversion watermark pdf-processing

TypeScript

129

6 天前

iamarunbrahma / pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

document-conversion document-processing information-retrieval pdf-parsing pdf-to-markdown Python rag retrieval-augmented-generation text-extraction pdf-converter

Python

10 个月前

awslabs / rhubarb

A Python framework for multi-modal document understanding with Amazon Bedrock

amazon-bedrock document-processing generative-ai multi-modal

Python

1 个月前

parsee-ai / parsee-core

Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular data extraction and multimodal queries.

document-processing 大语言模型 structured-data multimodal

Python

1 个月前

steindani / pandoc-include

An include filter for Pandoc

pandoc pandoc-filter Markdown document-processing

Haskell

5 年前

PSPDFKit / nutrient-document-engine-mcp-server

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

agentic-ai document-processing mcp-server

TypeScript

2 个月前

jmanhype / DSPy-Multi-Document-Agents

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

人工智能 distributed-systems document-processing knowledge-management 自然语言处理 query-optimization vector-search

Python

1 年前

aws-solutions / enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

document-analysis document-processing

JavaScript

4 天前

cburschka / lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

mirror document-processing LaTeX

C++

3 年前

abdullahshafiq-20 / ResumeTex

ResumeTex is an AI-powered tool that converts standard PDF resumes into professionally formatted LaTeX documents. This service helps you create elegant, structured resumes without needing to learn LaTeX syntax.

自动化 developer-tools document-processing Express LaTeX Node.js Open Source pdf-parsing React resume Tailwind CSS TeX

JavaScript

1 个月前