Trending repositories for topic spark

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

29,732 (+35)

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

259 (+17)

apache-2.0

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

40,837 (+12)

apache-2.0

deeplearning4j/deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...

13,906 (+9)

apache-2.0

donnemartin/data-science-ipython-notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...

28,033 (+9)

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+8)

apache-2.0

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+8)

apache-2.0

tobymao/sqlglot

Python SQL Parser and Transpiler

7,393 (+8)

mit

pdsuwwz/chatgpt-vue3-light-mvp

💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...

274 (+7)

mit

tencentmusic/cube-studio

cube studio开源云原生一站式机器学习/深度学习/大模型AI平台，支持sso登录，大数据平台对接，notebook在线开发，拖拉拽任务流pipeline编排，多机多卡分布式训练，超参搜索，推理服务VGPU，边缘计算，标注平台，自动化标注，大模型微调，vllm大模型推理，llmops，私有知识库，AI模型应用商店，支持模型一键开发/推理/微调，支持国产cpu/gpu/npu芯片，支持RDMA...

4,107 (+7)

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,225 (+7)

apache-2.0

heibaiying/BigData-Notes

大数据入门指南 :star:

16,294 (+7)

aalansehaiyang/technology-talk

【大厂面试专栏】一份Java程序员需要的技术指南，这里有面试题、系统架构、职场锦囊、主流中间件等，让你成为更牛的自己！

14,389 (+6)

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

27,136 (+6)

bsd-2-clause

yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

25,273 (+5)

moj-analytical-services/splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

1,531 (+4)

mit

vector4wang/spring-boot-quick

:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如：rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin...

2,587 (+4)

lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

2,667 (+4)

apache-2.0

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,913 (+4)

apache-2.0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+4)

apache-2.0

Last 3 days (relative gain)

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+8%)

apache-2.0

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

259 (+7%)

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274 (+3%)

mit

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+1%)

apache-2.0

apache/doris-spark-connector

Spark Connector for Apache Doris

89 (+1%)

apache-2.0

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

241 (+0.8%)

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+0.6%)

apache-2.0

fancyChuan/bigdata-hub

数据建设与大数据技术知识体系，包含hadoop、hive、spark、flink主流框架和系列框架，数据中台、数据湖、数据治理、数仓建设、数据化转型等

365 (+0.6%)

apache/uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

412 (+0.5%)

apache-2.0

xl-xueling/xl-lighthouse

新一代实时计算底座，计算性能超越flink/spark 100倍，XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括：PV、UV统计；电商销售额、下单用户数统计；日志量统计；接口调用量、异常量、耗时情况统计；服务器运维监控等功能，系统支持多维度统计，支持各种复杂的条件筛选和逻辑判断，一键部署，一行代码接入，轻松实现业务全链路...

301 (+0.3%)

apache-2.0

moj-analytical-services/splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

1,531 (+0.3%)

mit

NVIDIA/spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

880 (+0.2%)

apache-2.0

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+0.2%)

mit

tencentmusic/cube-studio

4,107 (+0.2%)

vector4wang/spring-boot-quick

2,587 (+0.2%)

lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

2,667 (+0.2%)

apache-2.0

ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

693 (+0.1%)

ohenley/awesome-ada

A curated list of awesome resources related to the Ada and SPARK programming language

696 (+0.1%)

cc0-1.0

mongodb/mongo-spark

The MongoDB Spark Connector

719 (+0.1%)

apache-2.0

awslabs/data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS

729 (+0.1%)

apache-2.0

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

29,732 (+167)

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

40,837 (+48)

apache-2.0

tobymao/sqlglot

Python SQL Parser and Transpiler

7,393 (+34)

mit

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+25)

apache-2.0

deeplearning4j/deeplearning4j

13,906 (+25)

apache-2.0

tencentmusic/cube-studio

4,107 (+24)

apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

13,389 (+23)

apache-2.0

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

259 (+21)

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274 (+20)

mit

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

27,136 (+20)

bsd-2-clause

donnemartin/data-science-ipython-notebooks

28,033 (+20)

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,225 (+19)

apache-2.0

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,913 (+18)

apache-2.0

zinggAI/zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

1,007 (+17)

agpl-3.0

XZB-1248/Spark

✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的，网页UI、跨平台以及多功能的远程控制和监控工具，你可以随时随地监...

2,011 (+15)

bsd-2-clause

yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

25,273 (+15)

vector4wang/spring-boot-quick

2,587 (+14)

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+13)

apache-2.0

lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

2,667 (+11)

apache-2.0

h2oai/h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Me...

7,092 (+11)

apache-2.0

Last week (relative gain)

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+14%)

apache-2.0

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

259 (+9%)

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274 (+8%)

mit

hoangsonww/Moodify-Emotion-Music-App

🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...

46 (+5%)

mit

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+4%)

apache-2.0

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

241 (+2%)

isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

173 (+2%)

apache-2.0

zinggAI/zingg

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

1,007 (+2%)

agpl-3.0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+2%)

apache-2.0

apache/doris-spark-connector

Spark Connector for Apache Doris

89 (+1%)

apache-2.0

josephmachado/data_engineering_best_practices

Sample project to demonstrate data engineering best practices

184 (+1%)

trannhatnguyen2/NYC_Taxi_Data_Pipeline

Nyc_Taxi_Data_Pipeline - DE Project

103 (+1.0%)

apache/spark-kubernetes-operator

Apache Spark Kubernetes Operator

106 (+1.0%)

apache-2.0

commoncrawl/cc-index-table

Index Common Crawl archives in tabular format

113 (+0.9%)

apache-2.0

ohenley/awesome-ada

A curated list of awesome resources related to the Ada and SPARK programming language

696 (+0.9%)

cc0-1.0

fancyChuan/bigdata-hub

数据建设与大数据技术知识体系，包含hadoop、hive、spark、flink主流框架和系列框架，数据中台、数据湖、数据治理、数仓建设、数据化转型等

365 (+0.8%)

XZB-1248/Spark

2,011 (+0.8%)

bsd-2-clause

apache/uniffle

Uniffle is a high performance, general purpose Remote Shuffle Service.

412 (+0.7%)

apache-2.0

OBenner/data-engineering-interview-questions

More than 2000+ Data engineer interview questions.

1,296 (+0.7%)

xl-xueling/xl-lighthouse

301 (+0.7%)

apache-2.0

Last month (new repositories)

no newly created repositories trending in the last month

Last month (absolute gain)

DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

29,732 (+603)

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

40,837 (+205)

apache-2.0

tobymao/sqlglot

Python SQL Parser and Transpiler

7,393 (+188)

mit

apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

13,389 (+152)

apache-2.0

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

27,136 (+141)

bsd-2-clause

tencentmusic/cube-studio

4,107 (+117)

donnemartin/data-science-ipython-notebooks

28,033 (+104)

heibaiying/BigData-Notes

大数据入门指南 :star:

16,294 (+101)

deeplearning4j/deeplearning4j

13,906 (+92)

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274 (+86)

mit

yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

25,273 (+86)

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+82)

apache-2.0

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,913 (+78)

apache-2.0

wangzhiwubigdata/God-Of-BigData

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

10,022 (+68)

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+64)

apache-2.0

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,225 (+59)

apache-2.0

lakesoul-io/LakeSoul

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

2,667 (+58)

apache-2.0

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+57)

apache-2.0

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,860 (+56)

vector4wang/spring-boot-quick

2,587 (+51)

Last month (relative gain)

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106 (+116%)

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274 (+46%)

mit

tuanx18/data-engineer-portfolio

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

101 (+28%)

Ironclad-Project/Ironclad

Formally verified, real-time capable, UNIX-like operating system kernel written in SPARK and Ada.

27 (+23%)

gpl-3.0

hoangsonww/Moodify-Emotion-Music-App

46 (+21%)

mit

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+16%)

apache-2.0

isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

173 (+14%)

apache-2.0

ManuelGuerra1987/data-engineering-zoomcamp-notes

Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club

42 (+14%)

thanhENC/e2e-data-platform

End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)

34 (+13%)

mit

hexnn/Stark

基于Spark+SparkMLlib+Debezium打造的简单易用、超高性能大数据治理引擎，适用于批流一体的数据集成和数据分析，支持机器学习算法模型、支持CDC实时数据采集，支持数据质量校验、数据建模、算法建模和OLAP数据分析

27 (+13%)

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

241 (+12%)

apache/spark-kubernetes-operator

Apache Spark Kubernetes Operator

106 (+12%)

apache-2.0

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

259 (+11%)

apache-2.0

kroudir/Data-Engineer-Nanodegree-Projects-Udacity

Projects done in the Data Engineer Nanodegree Program by Udacity.com

149 (+10%)

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+10%)

apache-2.0

opensearch-project/opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.

34 (+10%)

apache-2.0

Mrkuhuo/bigdata_learning

大数据组件学习代码

55 (+8%)

HamzaG737/data-engineering-project

End to end data engineering project with kafka, airflow, spark, postgres and docker.

86 (+8%)

mit

StabRise/spark-pdf

PDF DataSource for Apache Spark

45 (+7%)

agpl-3.0

iimeta/fastapi-sdk

企业级 LLM API 快速集成系统，支持OpenAI、Azure、文心一言、讯飞星火、通义千问、智谱GLM、Gemini、DeepSeek、Anthropic Claude以及OpenAI格式的模型等，简洁的页面风格，轻量高效且稳定，支持Docker一键部署。

31 (+7%)

mit

Last 12-months (new repositories)

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592

apache-2.0

pdsuwwz/chatgpt-vue3-light-mvp

274

mit

databrickslabs/dqx

Databricks framework to validate Data Quality of pySpark DataFrames

241

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

160

apache-2.0

LLM-Red-Team/spark-free-api

🚀 讯飞星火大模型逆向API【特长：办公助手】，支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话，零配置部署，多路token支持，自动清理会话痕迹，仅供测试，如需商用请前往官方开放平台。。

135

gpl-3.0

DataEval/dingo

Dingo: A Comprehensive Data Quality Evaluation Tool

106

apache-2.0

trannhatnguyen2/NYC_Taxi_Data_Pipeline

Nyc_Taxi_Data_Pipeline - DE Project

103

hoangsonww/Moodify-Emotion-Music-App

mit

StabRise/spark-pdf

PDF DataSource for Apache Spark

agpl-3.0

ManuelGuerra1987/data-engineering-zoomcamp-notes

Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club

thanhENC/e2e-data-platform

End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)

mit

hexnn/Stark

Ironclad-Project/Ironclad

Formally verified, real-time capable, UNIX-like operating system kernel written in SPARK and Ada.

gpl-3.0

mrpowers-io/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

apache-2.0

AsadiAhmad/Ngram-Spark-Wikipedia

Calculating Ngram with PySpark for wikipedia text

mit

AsadiAhmad/Dictionary-Spark

Calculating Word Count for coresponding english and persian text for making Dictionary with spark

mit

Last 12-months (absolute gain)

DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.

29,732 (+7,672)

apache/spark

Apache Spark - A unified analytics engine for large-scale data processing

40,837 (+2,689)

apache-2.0

FavioVazquez/ds-cheatsheets

List of Data Science Cheatsheets to rule the world

14,994 (+2,356)

mit

getredash/redash

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

27,136 (+2,340)

bsd-2-clause

apache/doris

Apache Doris is an easy-to-use, high performance and unified analytics database.

13,389 (+2,203)

apache-2.0

tobymao/sqlglot

Python SQL Parser and Transpiler

7,393 (+2,103)

mit

tencentmusic/cube-studio

4,107 (+1,841)

donnemartin/data-science-ipython-notebooks

28,033 (+1,692)

mage-ai/mage-ai

🧙 Build, run, and manage data pipelines for integrating and transforming data.

8,225 (+1,381)

apache-2.0

yeasy/docker_practice

Learn and understand Docker&Container technologies, with real DevOps practice!

25,273 (+1,219)

heibaiying/BigData-Notes

大数据入门指南 :star:

16,294 (+1,120)

delta-io/delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

7,913 (+1,103)

apache-2.0

apache/paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

2,705 (+942)

apache-2.0

wangzhiwubigdata/God-Of-BigData

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

10,022 (+828)

GaiZhenbiao/ChuanhuChatGPT

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

15,396 (+823)

gpl-3.0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

701 (+700)

apache-2.0

apache/datafusion-comet

Apache DataFusion Comet Spark Accelerator

922 (+598)

apache-2.0

data-prep-kit/data-prep-kit

Open source project for data preparation of LLM application builders

592 (+591)

apache-2.0

kwai/blaze

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

1,432 (+576)

apache-2.0

horovod/horovod

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

14,437 (+536)

Last 12-months (relative gain)

isxcode/spark-yun

Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>

173 (+1,473%)

apache-2.0

tuanx18/data-engineer-portfolio

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

101 (+1,163%)

HariSekhon/Knowledge-Base

Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public

160 (+841%)

mit

sjrusso8/spark-connect-rs

Apache Spark Connect Client for Rust

106 (+563%)

apache-2.0

xl-xueling/xl-lighthouse

301 (+514%)

apache-2.0

airscholar/SparkingFlow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

41 (+486%)

jomariya23156/sales-forecast-mlops-at-scale

Full-stack Highly Scalable Cloud-native Machine Learning system for demand forecasting with realtime data streaming, inference, retraining loop, and more

61 (+369%)

mit

HamzaG737/data-engineering-project

End to end data engineering project with kafka, airflow, spark, postgres and docker.

86 (+353%)

mit

opensearch-project/opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.

34 (+278%)

apache-2.0