Trending repositories for topic apache-spark
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
lakeFS - Data version control for your data lake | Git for data
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
lakeFS - Data version control for your data lake | Git for data
lakeFS - Data version control for your data lake | Git for data
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Code for "Efficient Data Processing in Spark" Course
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
Fundamentals of Spark with Python (using PySpark), code examples
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
A curated list of awesome Apache Spark packages and resources.
Feathr – A scalable, unified data and AI engineering platform for enterprise
Interactive and Reactive Data Science using Scala and Spark.
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
Fundamentals of Spark with Python (using PySpark), code examples
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
lakeFS - Data version control for your data lake | Git for data
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
A curated list of awesome Apache Spark packages and resources.
Feathr – A scalable, unified data and AI engineering platform for enterprise
Interactive and Reactive Data Science using Scala and Spark.
lakeFS - Data version control for your data lake | Git for data
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Code for "Efficient Data Processing in Spark" Course
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
A curated list of awesome Apache Spark packages and resources.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
This repository contains the code for a realtime election voting system. The system is built using Python, Kafka, Spark Streaming, Postgres and Streamlit. The system is built using Docker Compose to e...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Code for "Efficient Data Processing in Spark" Course
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
Built a real-time streaming pipeline to extract stock data, using Apache Nifi, Debezium, Kafka, and Spark Streaming. Loaded the transformed data into Glue database and created real-time dashboards usi...
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
Apache Tomcat exploit and Pentesting guide for penetration tester
Real-Time Financial Market Data Processing and Prediction application
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
OSM planet dump high performance data loader. Transform OpenStreetMap World/Region PBF dump into partitioned by H3 regions PostGIS pgsnapshot (lossless) OSM schema representation and/or into ArrowIPC/...
lakeFS - Data version control for your data lake | Git for data
Code for "Efficient Data Processing in Spark" Course
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
This repository contains the code for a realtime election voting system. The system is built using Python, Kafka, Spark Streaming, Postgres and Streamlit. The system is built using Docker Compose to e...
This project shows how to capture changes from postgres database and stream them into kafka
lakeFS - Data version control for your data lake | Git for data
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Code for "Efficient Data Processing in Spark" Course
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
A curated list of awesome Apache Spark packages and resources.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Feathr – A scalable, unified data and AI engineering platform for enterprise
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
BigDL: Distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray
Code for "Efficient Data Processing in Spark" Course
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
Built a real-time streaming pipeline to extract stock data, using Apache Nifi, Debezium, Kafka, and Spark Streaming. Loaded the transformed data into Glue database and created real-time dashboards usi...
Apache Tomcat exploit and Pentesting guide for penetration tester
Stream CDC into an Amazon S3 data lake in Apache Iceberg table format with AWS Glue Streaming and DMS
A Spark Connector that reads data from / writes data to Arrow-Flight end-points with Arrow-Flight and Flight-SQL
A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Experiment tracking server focused on speed and scalability
Real-Time Financial Market Data Processing and Prediction application
OSM planet dump high performance data loader. Transform OpenStreetMap World/Region PBF dump into partitioned by H3 regions PostGIS pgsnapshot (lossless) OSM schema representation and/or into ArrowIPC/...
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer