Repository navigation

#

document-analysis

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python
31399
2 天前
JavaScript
1294
2 个月前

This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

Jupyter Notebook
645
5 年前

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

Python
563
9 个月前

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

Python
346
2 年前

A package for parsing PDFs and analyzing their content using LLMs.

Python
269
8 个月前

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

Python
260
4 天前

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

Python
233
1 天前

Local adaptive image binarization

C++
126
2 年前

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

Python
120
9 个月前

An on-premises, OCR-free unstructured data extraction tool powered by vision language models.

Python
103
3 天前

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

Python
100
1 个月前

Post-process Amazon Textract results with Hugging Face transformer models for document understanding

Python
96
4 个月前

(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper

Python
88
2 年前