← Explore
TOPIC

#big-data

Open source repositories tagged with #big-data, ranked by health score.

apache
apache/zeppelin
Java
93
health

Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.

6.6k
fluid-cloudnative
fluid-cloudnative/fluid
Go
91
health

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)

1.9k
apache
apache/ozone
Java
91
health

Scalable, reliable, distributed storage system optimized for data analytics and object store workloads.

1.2k
arkime
arkime/arkime
C
89
health

Arkime is an open source, large scale, full packet capturing, indexing, and database system.

7.4k
apache
apache/datafusion-ballista
Rust
88
health

Apache DataFusion Ballista Distributed Query Engine

2.0k
StarRocks
StarRocks/starrocks
Java
88
health

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.

11.7k
vespa-engine
vespa-engine/vespa
Java
88
health

AI + Data, online. https://vespa.ai

6.9k
apache
apache/iotdb
Java
87
health

Apache IoTDB

6.3k
apache
apache/paimon
Java
87
health

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

3.3k
catboost
catboost/catboost
C++
87
health

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

9.0k
ClickHouse
ClickHouse/ClickHouse
C++
87
health

ClickHouse® is a real-time analytics database management system

47.6k
lakehq
lakehq/sail
Rust
86
health

Drop-in Apache Spark replacement written in Rust, unifying batch processing, stream processing, and compute-intensive AI workloads.

2.7k
trinodb
trinodb/trino
Java
86
health

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

12.8k
apache
apache/datafusion
Rust
85
health

Apache DataFusion SQL Query Engine

8.8k
ytsaurus
ytsaurus/ytsaurus
C++
84
health

YTsaurus is a scalable and fault-tolerant open-source big data platform.

2.2k
apache
apache/beam
Java
82
health

Apache Beam is a unified programming model for Batch and Streaming data processing.

8.6k