Trending repositories for topic pyspark
Lightweight and extensible compatibility layer between dataframe libraries!
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
A curated list of awesome Apache Spark packages and resources.
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
An open source, standard data file format for graph data storage and retrieval.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
🐍 Quick reference guide to common patterns & functions in PySpark.
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as ...
Implementing best practices for PySpark ETL jobs and applications.
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Lightweight and extensible compatibility layer between dataframe libraries!
An open source, standard data file format for graph data storage and retrieval.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
🐍 Quick reference guide to common patterns & functions in PySpark.
A curated list of awesome Apache Spark packages and resources.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as ...
PySpark-Tutorial provides basic algorithms using PySpark
Implementing best practices for PySpark ETL jobs and applications.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Implementing best practices for PySpark ETL jobs and applications.
Lightweight and extensible compatibility layer between dataframe libraries!
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
🐍 Quick reference guide to common patterns & functions in PySpark.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
A curated list of awesome Apache Spark packages and resources.
An open source, standard data file format for graph data storage and retrieval.
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are s...
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
Hopsworks - Data-Intensive AI platform with a Feature Store
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Code for "Efficient Data Processing in Spark" Course
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are s...
An open source, standard data file format for graph data storage and retrieval.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Lightweight and extensible compatibility layer between dataframe libraries!
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
🐍 Quick reference guide to common patterns & functions in PySpark.
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Implementing best practices for PySpark ETL jobs and applications.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
Code for "Efficient Data Processing in Spark" Course
A curated list of awesome Apache Spark packages and resources.
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
PySpark-Tutorial provides basic algorithms using PySpark
Hopsworks - Data-Intensive AI platform with a Feature Store
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
Implementing best practices for PySpark ETL jobs and applications.
Lightweight and extensible compatibility layer between dataframe libraries!
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
Hopsworks - Data-Intensive AI platform with a Feature Store
🐍 Quick reference guide to common patterns & functions in PySpark.
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
A curated list of awesome Apache Spark packages and resources.
Code for "Efficient Data Processing in Spark" Course
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Lightweight and extensible compatibility layer between dataframe libraries!
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka
A simple VS Code devcontainer setup for local PySpark development
Code for "Efficient Data Processing in Spark" Course
Implementing best practices for PySpark ETL jobs and applications.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center ...
This project was a joint effort by Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee to create a song and playlist embeddings for recommendations in a distributed fashion using a 1M playlist da...
🐍 Quick reference guide to common patterns & functions in PySpark.
Data Engineering examples for Airflow, Prefect; dbt for BigQuery, Redshift, ClickHouse, Postgres, DuckDB; PySpark for Batch processing; Kafka for Stream processing
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club
Code for blog at: https://www.startdataengineering.com/post/docker-for-de/
A flake8 plugin that detects of usage withColumn in a loop or inside reduce
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
Lightweight and extensible compatibility layer between dataframe libraries!
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Implementing best practices for PySpark ETL jobs and applications.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Code for "Efficient Data Processing in Spark" Course
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!
🐍 Quick reference guide to common patterns & functions in PySpark.
A curated list of awesome Apache Spark packages and resources.
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
Hopsworks - Data-Intensive AI platform with a Feature Store
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POC...
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
PySpark-Tutorial provides basic algorithms using PySpark
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Code for "Efficient Data Processing in Spark" Course
Lightweight and extensible compatibility layer between dataframe libraries!
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
This real-time project integrates flight information from the AviationStack API for DFW Airport and weather data from the National Weather Service API, to provide the latest arrival, departure, and fo...
A flake8 plugin that detects of usage withColumn in a loop or inside reduce
This project introduces PySpark, a powerful open-source framework for distributed data processing. We explore its architecture, components, and applications for real-time data analysis.
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
A simple VS Code devcontainer setup for local PySpark development
Possibly the fastest DataFrame-agnostic quality check library in town.
Sample project to demonstrate data engineering best practices
A Docker Compose template that builds a interactive development environment for PySpark with Jupyter Lab, MinIO as object storage, Hive Metastore, Trino and Kafka
🐍 Quick reference guide to common patterns & functions in PySpark.
Code/Notes for the Data Engineering Zoomcamp by DataTalksClub
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POC...
Pandas, Polars, Spark, and Snowpark DataFrame comparison for humans and more!