Repository navigation

#

document-parser

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

HTML
12407
6 天前
Marker-Inc-Korea/AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

Python
4191
2 个月前
Java
3846
2 天前
Filimoa/open-parse
Python
3042
9 个月前

LAYRA—an enterprise-ready, out-of-the-box solution—unlocks next-generation intelligent systems powered by visual RAG and limitless visual multi-step agent workflow orchestration.

TypeScript
797
7 小时前

Parse PDFs into markdown using Vision LLMs

Python
414
6 个月前

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python
332
1 天前

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

139
1 个月前

Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pipelines (GenAI, LLM, VLLM) into your applications, supporting various tasks such as document cleanup, optical character recognition (OCR), classification, splitting, named entity recognition, and form processing

Python
70
6 小时前

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python
52
5 个月前

Tutorial on how to deskew (straighten) text images

Python
52
3 年前

An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).

Python
21
1 年前

An OCR based document parser to extract information from identity document images

TypeScript
21
3 年前

Novalad offers a unified, centralized platform enabling organizations to extract meaningful data and perform advanced processing at high speed.

Jupyter Notebook
17
1 个月前