Repository navigation

data-lake

Website
Wikipedia

lakeFS - Data version control for your data lake | Git for data

data-engineering data-versioning Go object-storage data-lake aws-s3 data-quality azure-blob-storage google-cloud-storage git-for-data Apache Spark hadoop-filesystem datalake data-version-control azure-storage

4904

400

3 天前

dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

data Python data-engineering data-lake data-loading data-warehouse elt extract load transform

Python

4245

338

1 天前

apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Apache Spark hive SQL thrift jdbc spark-sql data-lake hadoop Kubernetes Hacktoberfest

Scala

2251

962

3 天前

san089 / Udacity-Data-Engineering-Projects

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

data data-engineering data-engineering-pipeline etl-pipeline cassandra-database postgresql-database data-modeling data-warehouse data-lake airflow cluster Apache Cassandra infrastructure PostgreSQL Amazon Web Services aws-ec2 aws-sdk aws-s3 cloudformation

Python

1747

550

3 年前

bytedance / bitsail

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

flink big-data data-integration data-lake data-pipeline data-synchronization high-performance real-time

Java

1673

334

2 年前

san089 / goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

etl-pipeline etl-framework Apache Spark apache-airflow airflow redshift emr-cluster livy s3 data-lake scheduler data-migration data-engineering data-engineering-pipeline Python etl-job

Python

1422

238

6 年前

Teradata / kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Apache Spark nifi data-lake teradata hadoop

Java

1114

570

3 年前

apache / amoro

Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats.

big-data data-lake flink hudi iceberg management-system paimon Apache Spark trino

Java

1053

355

3 天前

alanchn31 / Data-Engineering-Projects

Personal Data Engineering Projects

data-lake data-engineering data-warehouse Apache Cassandra MongoDB scrapy Apache Spark airflow PostgreSQL star-schema data-modeling

Jupyter Notebook

950

206

3 年前

lakekeeper / lakekeeper

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

catalog data-lake iceberg lakehouse Rust

Rust

922

6 小时前

Canner / vulcan-sql

Data API Framework for AI Agents and Data Apps

api-builder data-lake data-warehouse 数据库 SQL analytics reporting Spreadsheet BigQuery duckdb PostgreSQL snowflake restful-api TypeScript clickhouse ksqldb 人工智能 ai-agent

TypeScript

694

1 年前

pixelsdb / pixels

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

cloud-database data-lake data-warehouse 数据库 olap column-store

Java

641

131

10 小时前

uber / marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

hadoop data-lake avro-schema Apache Spark

Java

479

112

3 年前

aws-solutions-library-samples / data-lakes-on-aws

Enterprise-grade, production-hardened, serverless data lake on AWS

Serverless 框架 data-lake analytics Amazon Web Services etl data-engineering lake-formation Infrastructure as code best-practices

Python

472

149

4 天前

Canner / wren-engine

🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥

business-intelligence data 数据分析 data-analytics data-lake data-warehouse SQL semantic semantic-layer 大语言模型 Hacktoberfest agent agentic-ai 人工智能 mcp mcp-server

Java

451

126

2 天前

kaiwaehner / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

kafka hivemq MQTT kafka-streams kafka-connect ksql Tensorflow gRPC Java Python data-lake confluent ksqldb Terraform Google 云 Kubernetes cloud MongoDB

Jupyter Notebook

417

148

5 年前

gigapi / gigapi

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

API duckdb Go olap parquet s3 数据库 REST API SQL clickhouse-server datalake query-engine data-lake lakehouse

346

13 天前

cuebook / cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion Apache Spark spark-sql data-transfer pipelines data-pipeline zeppelin-notebook SQL

JavaScript

288

3 年前

maxi-k / btrblocks

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

compression data-lake 数据库 research

C++

261

6 个月前

awslabs / amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

data-lake amazon-s3 s3 gdpr Amazon Web Services parquet ccpa big-data 隐私 data

Python

242

3 个月前