Repository navigation
extract-text
- Website
- Wikipedia
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
⚠ ARCHIVED ⚠ Search across and get full text for OA & closed journals
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
Use the Java Tika text extraction library on the .NET platform
Multiple and Large PDF Documents Text Extraction.
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
C# and VB.NET samples for Docotic.Pdf library
R Interface to Apache Tika
Build search across multiple documents client-side in your file storage
simple rule based named entity recognition
A collection of tools for OCR (optical character recognition).
pdfRest API Toolkit is a REST API service for processing PDF documents, made by developers, for developers. Rapidly integrate PDF workflows with your existing projects and applications, simply and seamlessly. Get started for free in seconds.
Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python
VNDB explorer and VNR-like text hooker.
Text Processing & Segmentation Framework
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!