Trending repositories for topic data-engineering
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Always know what to expect from your data.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Turns Data and AI algorithms into production-ready web applications in no time.
Home of the Open Data Contract Standard (ODCS).
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Collection of Snowflake Notebook demos, tutorials, and examples
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
Code for "Efficient Data Processing in Spark" Course
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates t...
📙 Awesome Data Catalogs and Observability Platforms.
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
Data Engineering Project with Hadoop HDFS and Kafka
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard. The dashboard is then used to support a purchasing decision of which Headp...
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
This repo contains "Databricks Certified Data Engineer Associate" Questions and related docs.
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB and Superset
git push your data stack with Airbyte, Airflow, and dbt - 2022 Airflow Summit
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Collection of Snowflake Notebook demos, tutorials, and examples
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Home of the Open Data Contract Standard (ODCS).
Code for "Efficient Data Processing in Spark" Course
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
An orchestration platform for the development, production, and observation of data assets.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Always know what to expect from your data.
Arcane Insight is a data analytics project designed to harness the power of SQLMesh & DuckDB to collect, transform, and analyze data from Blizzard’s Hearthstone API. Focused on card statistics and att...
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
Data Engineering Project with Hadoop HDFS and Kafka
Turns Data and AI algorithms into production-ready web applications in no time.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
A portable Datamart and Business Intelligence suite built with Docker, Dagster, dbt, DuckDB and Superset
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
Collection of Snowflake Notebook demos, tutorials, and examples
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference
Code for blog at https://www.startdataengineering.com/post/python-for-de/
A portable Datamart and Business Intelligence suite built with Docker, Mage, dbt, DuckDB and Superset
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
DataOps Observability is part of DataKitchen's Open Source Data Observability. DataOps Observability monitors every data journey from data source to customer value, from any team development environm...
A DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
Arcane Insight is a data analytics project designed to harness the power of SQLMesh & DuckDB to collect, transform, and analyze data from Blizzard’s Hearthstone API. Focused on card statistics and att...
OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
An orchestration platform for the development, production, and observation of data assets.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
🧙 Build, run, and manage data pipelines for integrating and transforming data.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming ...
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Data Activation
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Code for "Efficient Data Processing in Spark" Course
The data-validation toolkit for enhanced dbt (data build tool) PR review
A curated list of open source tools used in analytics platforms and data engineering ecosystem
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
UnitGen 是一个用于生成微调代码的数据框架 —— 直接从你的代码库中生成微调数据:代码补全、测试生成、文档生成等。UnitGen is a code fine-tuning data framework that generates data from your existing codebase.
This repository is to show my Data Analytics & Engineering skills, share projects, and track my progress.
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
Turns Data and AI algorithms into production-ready web applications in no time.
Example Fondant pipeline preparing data to train a Controlnet model
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Data Engineering Project with Hadoop HDFS and Kafka