Repository navigation
corpus-tools
- Website
- Wikipedia
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
A very simple news crawler with a funny name
Bitextor generates translation memories from multilingual websites
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Python library for handling audio datasets.
OpusFilter - Parallel corpus processing toolkit
An advanced, extensible web front-end for the Manatee-open corpus search engine
Utilities for Processing the Switchboard Dialogue Act Corpus
An open source reimplementation of Benny Brodda's BETA in Python
SpeCT - Speech Corpus Toolkit for Praat. Documentation: https://lennes.github.io/spect/
A set of workflows for corpus building through OCR, post-correction and normalisation
Python library for extracting quantitative, reproducible metrics of multi-level alignment between speakers in naturalistic language corpora.
A parser for annotated MuseScore 3 files.
Multi-Language Dataset Cleaner/Creator for Mozilla's DeepSpeech Framework
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
Reading the data from OPIEC - an Open Information Extraction corpus
Rezonator: Dynamics of human engagement
Utilities for Processing the Meeting Recorder Dialogue Act Corpus