Repository navigation

#

extract-text

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

HTML
1667
3 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
314
2 年前

⚠ ARCHIVED ⚠ Search across and get full text for OA & closed journals

R
271
3 年前

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Python
268
3 年前

Use the Java Tika text extraction library on the .NET platform

Rich Text Format
207
1 年前
Python
128
2 个月前

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

Go
100
1 年前

Read pdf files on javascript

JavaScript
79
5 年前

R wrapper for antiword utility

C
58
16 天前

Build search across multiple documents client-side in your file storage

JavaScript
45
2 年前

An R package to extract text from pdf.

C++
40
2 年前

A collection of tools for OCR (optical character recognition).

C
30
6 个月前

pdfRest API Toolkit is a REST API service for processing PDF documents, made by developers, for developers. Rapidly integrate PDF workflows with your existing projects and applications, simply and seamlessly. Get started for free in seconds.

Java
26
2 个月前

Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python

Jupyter Notebook
25
4 年前

VNDB explorer and VNR-like text hooker.

C#
24
3 个月前

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

HTML
20
2 天前