Repository navigation

etl-pipeline

Website
Wikipedia

risingwavelabs / risingwave

Real-time event streaming platform. Streaming CDC, stream processing, low-latency serving, and Iceberg management.

数据库 stream-processing Rust PostgreSQL kafka materialized-view data-engineering apache-iceberg etl-pipeline

Rust

8409

687

3 小时前

Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

etl-pipeline llm-platform unstructured-data

Python

5826

552

1 天前

apache / streampark

Make stream processing easier! Easy-to-use streaming application development framework and operation platform.

streaming streampark apache development-framework easy-to-use etl-pipeline operation-platform

Java

4202

1046

18 天前

orchest / orchest

Build data pipelines, the easy way 🛠️

数据科学机器学习 pipelines ide Jupyter Notebook cloud 自托管 jupyterlab notebooks Docker Python data-pipelines 部署 Kubernetes airflow dag etl etl-pipeline

TypeScript

4141

263

2 年前

apache / hamilton

Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

数据科学 Python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering 机器学习 pandas 软件工程数据分析 lineage llmops mlops orchestration Hacktoberfest rag

Jupyter Notebook

2274

161

3 天前

AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

pyspark etl-job Python data-engineering Apache Spark 数据科学 etl etl-pipeline

Python

2000

770

3 年前

san089 / Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow cluster Apache Cassandra infrastructure PostgreSQL Amazon Web Services aws-ec2 aws-sdk aws-s3 cloudformation

Python

1747

550

3 年前

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

etl-pipeline etl-framework Apache Spark apache-airflow airflow redshift emr-cluster livy s3 data-lake scheduler data-migration data-engineering data-engineering-pipeline Python etl-job

Python

1422

238

6 年前

Open-Source-Legal / OpenContracts

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent agentic-ai etl etl-pipeline 大语言模型 unstructured-data vector-database prompt-engineering

TypeScript

933

1 天前

stitchfix / hamilton

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

Python pandas dag 数据科学 data-engineering NumPy 软件工程 etl-framework etl-pipeline etl feature-engineering dataframe data-platform 机器学习

Python

860

2 年前

techascent / tech.ml.dataset

A Clojure high performance data processing system

Clojure dataframe CSV xlsx datascience 机器学习 dataset etl-pipeline Java

Clojure

720

9 天前

trustgraph-ai / trustgraph

The agentic AI platform for enterprise. Built by data engineers for data engineers. Complete context engineering and LLM orchestration infrastructure. Run anywhere - local, cloud, or bare metal.

graphrag context context-engineering model-serving agentic-ai agentic-ai-development agentic-rag ai-native data data-engineering data-extraction etl-pipeline

Python

646

4 天前

SorellaLabs / brontes

A blazingly fast general purpose blockchain analytics engine specialized in systematic mev detection

以太坊 evm mev etl-pipeline Rust

Rust

637

2 个月前

Pravko-Solutions / FlashLearn

Integrate LLM in any pipeline - fit/predict pattern, JSON driven flows, and built in concurency support.

人工智能 ai-agents concurrency 大语言模型 llm-agent Python agentic-ai-development ai-agents-framework etl-pipeline

Python

606

7 个月前

YotpoLtd / metorikku

A simplified, lightweight ETL Framework based on Apache Spark

big-data Apache Spark Scala etl-framework distributed-computing SQL etl etl-pipeline

Scala

588

157

2 年前

DataWithBaraa / sql-data-warehouse-project

A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.

数据分析 data-analytics data-cleaning data-engineering 数据科学 data-warehouse data-warehousing datalake datascience datawarehouse etl etl-job etl-pipeline SQL sql-query sql-server

TSQL

356

288

5 个月前

unbody-io / unbody

The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

agentic-ai ai-native 后端聊天机器人 data-ingestion developer-tools etl-pipeline generative-ai knowledge-base 大语言模型 rag vector-database

TypeScript

354

4 个月前

ebonnal / streamable

concurrent & fluent interface for (async) iterables

data-engineering etl-pipeline etl reverse-etl collections streams fluent-interface immutability lazy-evaluation method-chaining visitor-pattern data Python asyncio concurrent-data-structure multiprocessing multithreading

Python

281

5 天前

airscholar / e2e-data-engineering

An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.

apache-airflow apache-kafka Apache Spark big-data Apache Cassandra containerization data-engineering data-pipeline data-processing Docker etl-pipeline PostgreSQL real-time-analytics

Python

280

127

8 个月前

jitsucom / bulker

Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)

data-engineering datawarehouse etl etl-pipeline ingestion pipeline

194

17 天前