Trending repositories for topic data-engineering
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Turns Data and AI algorithms into production-ready web applications in no time.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
An orchestration platform for the development, production, and observation of data assets.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Always know what to expect from your data.
Distributed DataFrame for Python designed for the cloud, powered by Rust
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
An Awesome List of Open-Source Data Engineering Projects
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Code for blog at https://www.startdataengineering.com/post/python-for-de/
Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.
Code for "Efficient Data Processing in Spark" Course
VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
An Awesome List of Open-Source Data Engineering Projects
Distributed DataFrame for Python designed for the cloud, powered by Rust
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Turns Data and AI algorithms into production-ready web applications in no time.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Implementing best practices for PySpark ETL jobs and applications.
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Turns Data and AI algorithms into production-ready web applications in no time.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
An orchestration platform for the development, production, and observation of data assets.
Datart is a next generation Data Visualization Open Platform
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Distributed DataFrame for Python designed for the cloud, powered by Rust
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
An Awesome List of Open-Source Data Engineering Projects
Always know what to expect from your data.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
This repository is to show my Data Analytics & Engineering skills, share projects, and track my progress.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
My Digital Palace - A Personal Journal for Reflection - A place to store all my thoughts
Code for blog at https://www.startdataengineering.com/post/python-for-de/
Datart is a next generation Data Visualization Open Platform
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB, PostgreSQL and Superset
Distributed DataFrame for Python designed for the cloud, powered by Rust
Supplementary Materials for the The Complete dbt (Data Build Tool) Bootcamp Udemy course
Titan Core - Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for ...
Code for "Efficient Data Processing in Spark" Course
A curated list of open source tools used in analytical stacks and data engineering ecosystem
An Awesome List of Open-Source Data Engineering Projects
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Turns Data and AI algorithms into production-ready web applications in no time.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
🧙 Build, run, and manage data pipelines for integrating and transforming data.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Distributed DataFrame for Python designed for the cloud, powered by Rust
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
A DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
Code for blog at https://www.startdataengineering.com/post/python-for-de/
This repository is to show my Data Analytics & Engineering skills, share projects, and track my progress.
Code for "Efficient Data Processing in Spark" Course
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Collection of Snowflake Notebook demos, tutorials, and examples
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB, PostgreSQL and Superset
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
A curated list of open source tools used in analytical stacks and data engineering ecosystem
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
Code for "Efficient Data Processing in Spark" Course
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
The open source, standalone, fullstack .NET job orchestrator that we've been missing.
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Code for blog at https://www.startdataengineering.com/post/python-for-de/
UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.
A portable Datamart and Business Intelligence suite built with Docker, Mage, dbt, DuckDB and Superset
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
An orchestration platform for the development, production, and observation of data assets.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Distributed DataFrame for Python designed for the cloud, powered by Rust
Code for "Efficient Data Processing in Spark" Course
Turns Data and AI algorithms into production-ready web applications in no time.
🎨 UI for the Free Data Engineering Zoomcamp Course provided by DataTalksClub
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB, PostgreSQL and Superset
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
A Python package that creates fine-grained dbt tasks on Apache Airflow
A Streamlit app to explore data engineering salary data.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
This documentation is like a quick snapshot of my project in the data field, showing off my skills and know-how in this area.
Dagster Labs' open-source data platform, built with Dagster.
Orbital automates integration between data sources (APIs, Databases, Queues and Functions). BFF's, API Composition and ETL pipelines that adapt as your specs change.
A software engineering framework to jump start your machine learning projects
Titan Core - Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for ...
Sample project to demonstrate data engineering best practices
breadroll 🥟 is a simple lightweight library for data processing operations written in Typescript and powered by Bun.