Repository navigation

#

document-analysis

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python
42157
18 小时前

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

Python
5578
7 天前

This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

Jupyter Notebook
648
5 年前

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

Python
582
13 天前

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

Python
567
1 年前

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

Python
351
3 年前

A package for parsing PDFs and analyzing their content using LLMs.

Python
272
1 年前

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

Python
268
8 天前

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

Python
132
16 天前

Local adaptive image binarization

C++
126
2 年前

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

Python
124
1 年前