Repository navigation
big-data
- Website
- Wikipedia
The Patterns of Scalable, Reliable, and Performant Large-Scale Systems
Apache Spark - A unified analytics engine for large-scale data processing
ClickHouse® is a real-time analytics database management system
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
An open source cybersecurity protocol for syncing decentralized graph data.
The official home of the Presto distributed SQL query engine for big data
The Data Engineering Cookbook
CMAK is a tool for managing Apache Kafka clusters
A distributed, fast open-source graph database featuring horizontal scalability and high availability
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Open-Source Web UI for Apache Kafka Management
The most widely used Python to C compiler
Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs