Repository navigation

data-ingestion

Website
Wikipedia

SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.

data-integration high-performance offline real-time apache batch cdc change-data-capture data-ingestion elt streaming embeddings 大语言模型 multimodal

Java

8807

2089

12 小时前

bruin-data / ingestr

ingestr is a CLI tool to copy data between any databases with a single command seamlessly.

BigQuery copy-database data-ingestion data-integration data-pipeline duckdb ingestion-pipeline sql-server PostgreSQL snowflake

Python

3236

106

1 天前

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

big-data data-ingestion flink paimon real-time-analytics Apache Spark table-store streaming-datalake

Java

3011

1239

5 天前

dashbitco / broadway

Concurrent and multi-stage data ingestion and data processing with Elixir

Elixir data-ingestion data-processing concurrent

Elixir

2577

169

3 天前

pravega / pravega

Pravega - Streaming as a new software defined storage primitive

streaming streaming-data distributed-storage real-time-data data-ingestion

Java

2003

408

7 个月前

bruin-data / bruin

Build data pipelines with SQL and Python, ingest data from different sources, add quality checks, and build end-to-end flows.

analytics BigQuery data-modeling data-pipelines Python snowflake SQL 数据分析 data-transformation data-ingestion data-platform

1019

9 小时前

CrunchyData / pg_parquet

Copy to/from Parquet in S3, Azure Blob Storage, Google Cloud Storage, http(s) stores, local files or standard inout stream from within PostgreSQL

columnar data-ingestion data-migration parquet PostgreSQL azure-storage google-cloud-storage HTTP s3

Rust

597

3 天前

unbody-io / unbody

The Supabase of AI era. A modular, open-source backend for building AI-native software — designed for knowledge, not static data.

agentic-ai ai-native 后端聊天机器人 data-ingestion developer-tools etl-pipeline generative-ai knowledge-base 大语言模型 rag vector-database

TypeScript

355

4 个月前

orbitalapi / orbital

Orbital automates integration between data sources (APIs, Databases, Queues and Functions). BFF's, API Composition and ETL pipelines that adapt as your specs change.

API integration Kotlin 微服务 api-integration api-management REST API TypeScript data-engineering data-ingestion etl Java

TypeScript

334

3 个月前

cuebook / cuelake

Use SQL to build ELT pipelines on a data lakehouse.

apache-iceberg delta lakehouse datalake data-lake elt etl data-engineering data-integration data-ingestion Apache Spark spark-sql data-transfer pipelines data-pipeline zeppelin-notebook SQL

JavaScript

288

3 年前

merantix-momentum / squirrel-core

A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way 🌰

Python 机器学习数据科学机器视觉 cv 自然语言处理人工智能 PyTorch Tensorflow datasets distributed DataOps 深度学习 data-ingestion cloud-computing collaboration internal

Python

281

5 个月前

apache / paimon-rust

Apache Paimon Rust The rust implementation of Apache Paimon.

big-data data-ingestion paimon real-time-analytics Rust streaming-datalake table-store

Rust

130

5 个月前

thedataengineeringbook / thedataengineeringbook

The Data Engineering Book - หนังสือวิศวกรรมข้อมูล ของคนไทย เพื่อคนไทย

data-engineering data Hacktoberfest book data-engineer data-pipeline data-integration data-ingestion data-infrastructure

JavaScript

114

2 个月前

jgperrin / net.jgp.labs.spark

Apache Spark examples exclusively in Java

Apache Spark ingestion Java data-ingestion dataframe

Java

102

2 年前

paloaltodatabases / sequor

Build complete API integrations with YAML and SQL. Rapid development without vendor lock-in and per-row costs.

api-integration data-integration etl ipaas SQL workflow-automation data-engineering data-ingestion reverse-etl

Python

4 个月前

XavientInformationSystems / Data-Ingestion-Platform

data-ingestion flink storm apex Apache Spark batch-processing

Java

6 年前

merantix-momentum / squirrel-datasets-core

Squirrel dataset hub

Python 数据科学机器学习自然语言处理人工智能机器视觉深度学习 Tensorflow cv collaboration PyTorch distributed DataOps cloud-computing datasets data-ingestion

Python

2 年前

aws-samples / amazon-kinesis-data-processor-aws-fargate

Sample code for the AWS Big Data Blog Post Building a scalable streaming data processor with Amazon Kinesis Data Streams on AWS Fargate

data-ingestion containers

Python

6 个月前

Dynatrace / OneAgent-SDK-for-Java

Enables custom tracing of Java applications in Dynatrace

SDK sdk-java Application Performance Management (APM)agent data-ingestion

Java

5 个月前

Dynatrace / openkit-java

OpenKit Java Reference Implementation

data-ingestion SDK

Java

1 年前