Trending repositories for topic data-engineering
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Turns Data and AI algorithms into production-ready web applications in no time.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
An orchestration platform for the development, production, and observation of data assets.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Always know what to expect from your data.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
A curated list of awesome blogs, videos, tools and resources about Data Contracts
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
A collection of Airflow operators, hooks, and utilities to elevate dbt to a first-class citizen of Airflow.
Coursera Specialization: Machine Learning and Data Analysis (Yandex & MIPT)
Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...
The best place to learn data engineering. Built and maintained by the data engineering community.
Code for "Efficient Data Processing in Spark" Course
Practical Data Engineering: A Hands-On Real-Estate Project Guide
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
Turns Data and AI algorithms into production-ready web applications in no time.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
Learn how to design, develop, deploy and iterate on production-grade ML applications.
An orchestration platform for the development, production, and observation of data assets.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
🧙 Build, run, and manage data pipelines for integrating and transforming data.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Always know what to expect from your data.
lakeFS - Data version control for your data lake | Git for data
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Collection of Snowflake Notebook demos, tutorials, and examples
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Toolbox for building Generative AI applications on top of Apache Spark.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
A curated list of open source tools used in analytics platforms and data engineering ecosystem
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
Construct a modern data stack and orchestration the workflows to create high quality data for analytics and ML applications.
Dataplane is an Airflow inspired unified data platform with additional data mesh and RPA capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a...
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Learn how to design, develop, deploy and iterate on production-grade ML applications.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
An orchestration platform for the development, production, and observation of data assets.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
🧙 Build, run, and manage data pipelines for integrating and transforming data.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Data Engineering Project with Hadoop HDFS and Kafka
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
Collection of Snowflake Notebook demos, tutorials, and examples
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
This repo contains "Databricks Certified Data Engineer Associate" Questions and related docs.
A Python package that creates fine-grained dbt tasks on Apache Airflow
Toolbox for building Generative AI applications on top of Apache Spark.
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Code for "Efficient Data Processing in Spark" Course
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Code for blog at https://www.startdataengineering.com/post/python-for-de/
UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.
A portable Datamart and Business Intelligence suite built with Docker, Mage, dbt, DuckDB and Superset
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
DataOps Observability is part of DataKitchen's Open Source Data Observability. DataOps Observability monitors every data journey from data source to customer value, from any team development environm...
A DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
An orchestration platform for the development, production, and observation of data assets.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Code for "Efficient Data Processing in Spark" Course
Turns Data and AI algorithms into production-ready web applications in no time.
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
A curated list of open source tools used in analytics platforms and data engineering ecosystem
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
A Python package that creates fine-grained dbt tasks on Apache Airflow
breadroll 🥟 is a simple lightweight library for data processing operations written in Typescript and powered by Bun.
Titan Core - Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for ...
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...
Data Engineering Project with Hadoop HDFS and Kafka
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB, PostgreSQL and Superset