Trending repositories for topic pyspark

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

ibis-project/ibis

the portable Python dataframe library

5,635 (+7)

apache-2.0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+5)

mit

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+4)

apache-2.0

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+3)

cc0-1.0

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+3)

cc0-1.0

gljoseph/dfw-skytrack

This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...

25 (+2)

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+2)

mit

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

239 (+1)

apache-2.0

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+1)

mit

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+1)

mit

kuwala-io/kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as ...

792 (+1)

apache-2.0

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+1)

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+1)

Last 3 days (relative gain)

gljoseph/dfw-skytrack

25 (+9%)

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+0.7%)

cc0-1.0

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+0.6%)

apache-2.0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+0.6%)

mit

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

239 (+0.4%)

apache-2.0

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+0.2%)

mit

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+0.2%)

mit

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+0.2%)

cc0-1.0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+0.1%)

mit

kuwala-io/kuwala

792 (+0.1%)

apache-2.0

ibis-project/ibis

the portable Python dataframe library

5,635 (+0.1%)

apache-2.0

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+0.1%)

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+0.1%)

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

ibis-project/ibis

the portable Python dataframe library

5,635 (+27)

apache-2.0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+17)

mit

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+12)

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+9)

mit

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

3,945 (+7)

apache-2.0

gljoseph/dfw-skytrack

25 (+6)

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+6)

apache-2.0

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+4)

mit

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

630 (+4)

apache-2.0

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+4)

cc0-1.0

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

239 (+3)

apache-2.0

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+3)

cc0-1.0

microsoft/SynapseML

Simple and Distributed Machine Learning

5,109 (+3)

mit

groda/big_data

Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are s...

73 (+2)

mit

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+2)

mit

MrPowers/chispa

PySpark test helper methods with beautiful error messages

674 (+2)

mit

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+2)

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

1,216 (+2)

agpl-3.0

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

3,350 (+2)

apache-2.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+1)

Last week (relative gain)

gljoseph/dfw-skytrack

25 (+32%)

groda/big_data

73 (+3%)

mit

apache/incubator-graphar

An open source, standard data file format for graph data storage and retrieval.

239 (+1%)

apache-2.0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+1%)

mit

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+1%)

mit

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+0.9%)

apache-2.0

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+0.8%)

mit

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+0.7%)

cc0-1.0

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+0.7%)

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

630 (+0.6%)

apache-2.0

ibis-project/ibis

the portable Python dataframe library

5,635 (+0.5%)

apache-2.0

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+0.4%)

mit

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+0.3%)

MrPowers/chispa

PySpark test helper methods with beautiful error messages

674 (+0.3%)

mit

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+0.2%)

cc0-1.0

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

538 (+0.2%)

apache-2.0

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

3,945 (+0.2%)

apache-2.0

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+0.2%)

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

1,216 (+0.2%)

agpl-3.0

ankurchavda/SparkLearning

A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

669 (+0.1%)

Last month (new repositories)

gljoseph/dfw-skytrack

Last month (absolute gain)

ibis-project/ibis

the portable Python dataframe library

5,635 (+100)

apache-2.0

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+70)

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+64)

mit

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+58)

apache-2.0

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+46)

mit

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

3,945 (+25)

apache-2.0

gljoseph/dfw-skytrack

25 (+18)

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

1,216 (+17)

agpl-3.0

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+16)

mit

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

538 (+12)

apache-2.0

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+12)

cc0-1.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+11)

MrPowers/chispa

PySpark test helper methods with beautiful error messages

674 (+10)

mit

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

3,350 (+10)

apache-2.0

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

161 (+9)

apache-2.0

microsoft/SynapseML

Simple and Distributed Machine Learning

5,109 (+9)

mit

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

330 (+7)

apache-2.0

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+7)

cc0-1.0

cluster-apps-on-docker/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

486 (+7)

mit

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+7)

Last month (relative gain)

gljoseph/dfw-skytrack

25 (+257%)

ManuelGuerra1987/data-engineering-zoomcamp-notes

Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club

39 (+18%)

AsadiAhmad/Ngram-Spark-Wikipedia

Calculating Ngram with PySpark for wikipedia text

26 (+18%)

mit

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+9%)

apache-2.0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+8%)

mit

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

161 (+6%)

apache-2.0

airscholar/SparkingFlow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

41 (+5%)

gmrqs/lasagna

A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka

43 (+5%)

myamafuj/hadoop-hive-spark-docker

Hadoop-Hive-Spark cluster + Jupyter on Docker

69 (+5%)

jplane/pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

48 (+4%)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+4%)

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+4%)

jmcmt87/spark_app_twitter

A data engineering project (Twitter monitor app)

81 (+4%)

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+3%)

mit

mrugankray/Big-Data-Cluster

The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center ...

61 (+3%)

mit

lbdeoliveira/song-playlist-recommendation

This project was a joint effort by Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee to create a song and playlist embeddings for recommendations in a distributed fashion using a 1M playlist da...

188 (+3%)

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+3%)

mit

iobruno/data-engineering-examples

Data Engineering examples for Airflow, Prefect; dbt for BigQuery, Redshift, ClickHouse, Postgres, DuckDB; PySpark for Batch processing; Kafka for Stream processing

64 (+3%)

cc-by-sa-4.0

opentargets/gentropy

Open Targets python framework for post-GWAS analysis

35 (+3%)

apache-2.0

mitchelllisle/sparkdantic

✨ A Pydantic to PySpark schema library

72 (+3%)

mit

Last 12-months (new repositories)

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

630

apache-2.0

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

161

apache-2.0

ManuelGuerra1987/data-engineering-zoomcamp-notes

Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club

josephmachado/docker_for_data_engineers

Code for blog at: https://www.startdataengineering.com/post/docker-for-de/

SemyonSinchenko/flake8-pyspark-with-column

A flake8 plugin that detects of usage withColumn in a loop or inside reduce

apache-2.0

mrpowers-io/tsumugi-spark

SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.

apache-2.0

AsadiAhmad/Ngram-Spark-Wikipedia

Calculating Ngram with PySpark for wikipedia text

mit

gljoseph/dfw-skytrack

Last 12-months (absolute gain)

ibis-project/ibis

the portable Python dataframe library

5,635 (+1,700)

apache-2.0

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+860)

mit

lakehq/sail

LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.

693 (+692)

apache-2.0

Nike-Inc/koheesio

Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.

630 (+629)

apache-2.0

AlexIoannides/pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1,858 (+490)

ptyadana/SQL-Data-Analysis-and-Visualization-Projects

SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.

1,403 (+365)

mit

JohnSnowLabs/spark-nlp

State of the Art Natural Language Processing

3,945 (+295)

apache-2.0

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+283)

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

330 (+212)

apache-2.0

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

538 (+186)

apache-2.0

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+181)

mit

MrPowers/chispa

PySpark test helper methods with beautiful error messages

674 (+173)

mit

awesome-spark/awesome-spark

A curated list of awesome Apache Spark packages and resources.

1,771 (+171)

cc0-1.0

davidzajac1/zillacode

Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake

161 (+160)

apache-2.0

logicalclocks/hopsworks

Hopsworks - Data-Intensive AI platform with a Feature Store

1,216 (+153)

agpl-3.0

microsoft/SynapseML

Simple and Distributed Machine Learning

5,109 (+153)

mit

databrickslabs/dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POC...

395 (+140)

apache/linkis

Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.

3,350 (+125)

apache-2.0

mahmoudparsian/pyspark-tutorial

PySpark-Tutorial provides basic algorithms using PySpark

1,213 (+119)

cartershanklin/pyspark-cheatsheet

PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster

450 (+103)

cc0-1.0

Last 12-months (relative gain)

josephmachado/efficient_data_processing_spark

Code for "Efficient Data Processing in Spark" Course

287 (+7,075%)

narwhals-dev/narwhals

Lightweight and extensible compatibility layer between dataframe libraries!

889 (+2,966%)

mit

airscholar/SparkingFlow

This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.

41 (+486%)

coder2j/pyspark-tutorial

PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...

101 (+339%)

mit

gljoseph/dfw-skytrack

25 (+257%)

SemyonSinchenko/flake8-pyspark-with-column

A flake8 plugin that detects of usage withColumn in a loop or inside reduce

27 (+238%)

apache-2.0

Thanaraklee/Real-Time-PySpark

This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.

27 (+200%)

sb-ai-lab/RePlay

A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models

330 (+180%)

apache-2.0

mitchelllisle/sparkdantic

✨ A Pydantic to PySpark schema library

72 (+177%)

mit

allisonwang-db/pyspark-data-sources

Custom PySpark Data Sources

41 (+173%)

apache-2.0

jplane/pyspark-devcontainer

A simple VS Code devcontainer setup for local PySpark development

48 (+167%)

canimus/cuallee

Possibly the fastest DataFrame-agnostic quality check library in town.

183 (+126%)

apache-2.0

opentargets/gentropy

Open Targets python framework for post-GWAS analysis

35 (+84%)

apache-2.0

josephmachado/data_engineering_best_practices

Sample project to demonstrate data engineering best practices

182 (+72%)

myamafuj/hadoop-hive-spark-docker

Hadoop-Hive-Spark cluster + Jupyter on Docker

69 (+60%)

gmrqs/lasagna

A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka

43 (+59%)

kevinschaich/pyspark-cheatsheet

🐍 Quick reference guide to common patterns & functions in PySpark.

508 (+55%)

mit

Balajirvp/DE-Zoomcamp

Code/Notes for the Data Engineering Zoomcamp by DataTalksClub

31 (+55%)

apache-2.0

databrickslabs/dbldatagen

395 (+55%)

capitalone/datacompy

Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!

538 (+53%)

apache-2.0