Repository navigation
hocr
- Website
- Wikipedia
Read and extract text and other content from PDFs in C# (port of PDFBox)
A Gtk/Qt front-end to tesseract-ocr.
OCR engine for all the languages
Document Layout Analysis resources repos for development with PdfPig.
Web interface for recognizing text, proofreading OCR, and creating fully-digitized documents.
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
Text Overlay plugin for Mirador 3
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
Ergonomic line-by-line transcription of scanned text.
Probabilistic Key Value pair extraction using word weights from Invoices - Non Searchable PDF
Some basic data and text extraction from the New York City Directories
CLI-Tool to recognise handwritten text from answer sheets using Tesseract OCR. Using this extracted text to evaluate marks using NLP
The data for guides to breweries across the United States from 1896 to 1918
Python parser for hOCR files using lxml