Repository navigation
data-processing
- Website
- Wikipedia
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.
A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
Data processing for and with foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷
A lightweight data processing framework built on DuckDB and 3FS.
A light-weight, flexible, and expressive statistical data testing library
Data transformation framework for AI. Ultra performant, with incremental processing.
Concurrent and multi-stage data ingestion and data processing with Elixir
Large-scale pretraining for dialogue
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Kubernetes-native platform to run massively parallel data/streaming jobs
Python Stream Processing
Extract Transform Load for Python 3.5+
Concurrent Python made simple
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Data and tools for generating and inspecting OLMo pre-training data.
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
Scalable data pre processing and curation toolkit for LLMs