Repository navigation

#

data-lake

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Python
4024
2 小时前

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Scala
2232
5 天前

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

Java
1668
2 年前

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Java
1114
3 年前

Apache Amoro(incubating) is a Lakehouse management system built on open data lake formats.

Java
1039
2 天前

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

Rust
858
8 小时前

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

Java
587
2 天前

Generic Data Ingestion & Dispersal Library for Hadoop

Java
480
2 年前

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

Jupyter Notebook
416
5 年前

GigAPI is a Timeseries lakehouse for real-time data and sub-second queries, powered by DuckDB OLAP + Parquet Query Engine, Compactor w/ Cloud-Native Storage. Drop-in FDAP alternative ⭐

Go
306
8 小时前

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

C++
251
4 个月前

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Python
242
1 个月前