Repository navigation

#

extract-text

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

HTML
1684
3 年前

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

HTML
326
2 年前

⚠ ARCHIVED ⚠ Search across and get full text for OA & closed journals

R
270
3 年前

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

Python
270
3 年前

Use the Java Tika text extraction library on the .NET platform

Rich Text Format
206
1 年前
Python
130
6 个月前

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

Go
102
2 年前

Read pdf files on javascript

JavaScript
80
5 年前

R wrapper for antiword utility

C
57
4 个月前

Build search across multiple documents client-side in your file storage

JavaScript
45
2 年前

An R package to extract text from pdf.

C++
40
2 年前

A collection of tools for OCR (optical character recognition).

C
30
10 个月前

pdfRest API Toolkit is a REST API service for processing PDF documents, made by developers, for developers. Rapidly integrate PDF workflows with your existing projects and applications, simply and seamlessly. Get started for free in seconds.

Java
28
13 天前

Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python

Jupyter Notebook
25
4 年前

VNDB explorer and VNR-like text hooker.

C#
24
3 个月前

ZWSP-Tool is a powerful toolkit that allows to manipulate zero width spaces quickly and easily. ZWSP-Tool allows in particular to detect, clean, hide, extract and bruteforce a text containing zero width spaces.

Python
23
5 年前