Repository navigation

#

document-parsing

Awesome multilingual OCR and Document Parsing toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

Python
52773
19 小时前

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

HTML
12407
6 天前
enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Python
1373
17 天前

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

TypeScript
356
5 个月前

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

Python
332
1 天前

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

139
1 个月前

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

Python
52
5 个月前

A Unified Toolkit for Deep Learning-Based Table Extraction

Python
45
9 个月前

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

Jupyter Notebook
44
18 天前

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

29
2 年前

Docling4j brings the functionalities of Docling in document understanding to Java® projects

Java
14
5 个月前

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.

JavaScript
6
4 个月前

Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data

Python
4
10 个月前

Docparser OCR Package for PHP Laravel

PHP
3
8 天前

Transform your documents into intelligent conversations. This open-source RAG chatbot combines semantic search with fine-tuned language models (LLaMA, Qwen2.5VL-3B) to deliver accurate, context-aware responses from your own knowledge base. Join our community!

Python
2
7 天前

LeapRAG is an open-source platform that integrates advanced RAG technology with Google’s A2A protocol, enabling users to build context-aware, data-driven agents. These agents are automatically A2A-compliant and can be discovered and used by any compatible client without extra development.

Python
2
3 个月前