Repository navigation

#

data-lake

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

Python
3485
2 小时前

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

Scala
2178
3 天前

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

Java
1655
1 年前

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Java
1111
2 年前

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

Rust
579
16 小时前

Generic Data Ingestion & Dispersal Library for Hadoop

Java
478
2 年前

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

Jupyter Notebook
414
4 年前

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

Python
243
1 个月前

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

C++
239
12 天前

U-SQL Examples and Issue Tracking

C#
234
2 年前

Resources for video demonstrations and blog posts related to DataOps on AWS

Python
175
3 年前

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

Java
143
5 天前