Trending repositories for topic apache-spark
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
lakeFS - Data version control for your data lake | Git for data
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Feathr – A scalable, unified data and AI engineering platform for enterprise
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Feathr – A scalable, unified data and AI engineering platform for enterprise
lakeFS - Data version control for your data lake | Git for data
lakeFS - Data version control for your data lake | Git for data
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Feathr – A scalable, unified data and AI engineering platform for enterprise
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
In this project, we setup and end to end data engineering using Apache Spark, Azure Databricks, Data Build Tool (DBT) using Azure as our cloud provider.
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's perfor...
A curated list of awesome Apache Spark packages and resources.
In this project, we setup and end to end data engineering using Apache Spark, Azure Databricks, Data Build Tool (DBT) using Azure as our cloud provider.
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's perfor...
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
lakeFS - Data version control for your data lake | Git for data
Feathr – A scalable, unified data and AI engineering platform for enterprise
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
A curated list of awesome Apache Spark packages and resources.
lakeFS - Data version control for your data lake | Git for data
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
ELT Data Pipeline implementation in Data Warehousing environment
A curated list of awesome Apache Spark packages and resources.
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
Code for "Efficient Data Processing in Spark" Course
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Feathr – A scalable, unified data and AI engineering platform for enterprise
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's perfor...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Interactive and Reactive Data Science using Scala and Spark.
ELT Data Pipeline implementation in Data Warehousing environment
This repository contains the code for a realtime election voting system. The system is built using Python, Kafka, Spark Streaming, Postgres and Streamlit. The system is built using Docker Compose to e...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
In this project, we setup and end to end data engineering using Apache Spark, Azure Databricks, Data Build Tool (DBT) using Azure as our cloud provider.
Code for "Efficient Data Processing in Spark" Course
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
Solution Accelerators for Serverless Spark on GCP, the industry's first auto-scaling and serverless Spark as a service
Tutorials on Big Data essentials: Hadoop, MapReduce, Spark. Explore a variety of tutorials and demonstrations on Big Data technologies, primarily in the form of Jupyter notebooks. Most notebooks are s...
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing and measuring CPUs's perfor...
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
lakeFS - Data version control for your data lake | Git for data
The Internals of Spark Structured Streaming
ELT Data Pipeline implementation in Data Warehousing environment
lakeFS - Data version control for your data lake | Git for data
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
Code for "Efficient Data Processing in Spark" Course
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
A curated list of awesome Apache Spark packages and resources.
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spark jobs. It focuses on easing the collection and examination of...
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs
.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
Code for "Efficient Data Processing in Spark" Course
Sample code to collect Apache Iceberg metrics for table monitoring
⛳️ PASS: Amazon Web Services Certified (AWS Certified) Machine Learning Specialty (MLS-C01) by learning based on our Questions & Answers (Q&A) Practice Tests Exams.
RealTime StockStream is a streamlined, simulation system for processing live stock market data. It uses Apache Kafka for data input, Apache Spark for data handling, and Apache Cassandra for data stora...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data ...
This repository contains the code for a realtime election voting system. The system is built using Python, Kafka, Spark Streaming, Postgres and Streamlit. The system is built using Docker Compose to e...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
In this project, we setup and end to end data engineering using Apache Spark, Azure Databricks, Data Build Tool (DBT) using Azure as our cloud provider.
This project shows how to capture changes from postgres database and stream them into kafka
Built a real-time streaming pipeline to extract stock data, using Apache Nifi, Debezium, Kafka, and Spark Streaming. Loaded the transformed data into Glue database and created real-time dashboards usi...
Stream CDC into an Amazon S3 data lake in Apache Iceberg table format with AWS Glue Streaming and DMS
Apache Tomcat exploit and Pentesting guide for penetration tester
A high-performance, scalable and efficient ShuffleManager plugin for Apache Spark, utilizing UCX communication layer
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
A VS Code Extension to make it easier to manage and develop Spark jobs on EMR
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster