Repository navigation

#

etl-pipeline

risingwavelabs/risingwave

Real-time event streaming platform. Streaming CDC, stream processing, low-latency serving, and Iceberg management.

Rust
8409
3 小时前
Zipstack/unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

Python
5826
1 天前
apache/streampark

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.

Java
4202
18 天前

Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

Jupyter Notebook
2274
3 天前

Implementing best practices for PySpark ETL jobs and applications.

Python
2000
3 年前

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

TypeScript
933
1 天前

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

Python
860
2 年前

A Clojure high performance data processing system

Clojure
720
9 天前

The agentic AI platform for enterprise. Built by data engineers for data engineers. Complete context engineering and LLM orchestration infrastructure. Run anywhere - local, cloud, or bare metal.

Python
646
4 天前

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

Rust
637
2 个月前

Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

Python
606
7 个月前

A simplified, lightweight ETL Framework based on Apache Spark

Scala
588
2 年前

A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

TSQL
356
5 个月前

The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

TypeScript
354
4 个月前

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

Python
280
8 个月前

Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)

Go
194
17 天前