Trending repositories for topic data-engineering
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Apache Superset is a Data Visualization and Data Exploration Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
An orchestration platform for the development, production, and observation of data assets.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
A list of useful resources to learn Data Engineering from scratch
Turns Data and AI algorithms into production-ready web applications in no time.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
A declarative PySpark framework for row- and aggregate-level data quality validation.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
About The most comprehensive SQL guide from a real-world expert! Learn everything from basics to advanced queries, optimizations, and real-world SQL
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
The data-validation toolkit for enhanced dbt (data build tool) PR review
A MCP (Model Context Protocol) server for interacting with dbt.
Data Engineering Project with Hadoop HDFS and Kafka
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Dagster Labs' open-source data platform, built with Dagster.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Supplementary Materials for the The Complete dbt (Data Build Tool) Bootcamp Udemy course
Code for "Efficient Data Processing in Spark" Course
A list of useful resources to learn Data Engineering from scratch
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
Home of the Open Data Contract Standard (ODCS).
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
An orchestration platform for the development, production, and observation of data assets.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Turns Data and AI algorithms into production-ready web applications in no time.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
A declarative PySpark framework for row- and aggregate-level data quality validation.
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
About The most comprehensive SQL guide from a real-world expert! Learn everything from basics to advanced queries, optimizations, and real-world SQL
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
A MCP (Model Context Protocol) server for interacting with dbt.
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
The data-validation toolkit for enhanced dbt (data build tool) PR review
A curated list of open source tools used in analytics platforms and data engineering ecosystem
Dataform Tools - VS Code extension to run and visualise Dataform data pipelines and much more
Data Engineering Project with Hadoop HDFS and Kafka
Dagster Labs' open-source data platform, built with Dagster.
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Home of the Open Data Contract Standard (ODCS).
Interactive Python TUI for visualizing and analyzing files with multipe formats
A declarative PySpark framework for row- and aggregate-level data quality validation.
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
An orchestration platform for the development, production, and observation of data assets.
A MCP (Model Context Protocol) server for interacting with dbt.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
Turns Data and AI algorithms into production-ready web applications in no time.
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
A MCP (Model Context Protocol) server for interacting with dbt.
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Interactive Python TUI for visualizing and analyzing files with multipe formats
About The most comprehensive SQL guide from a real-world expert! Learn everything from basics to advanced queries, optimizations, and real-world SQL
ELT Data Pipeline implementation in Data Warehousing environment
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
This repository contains a collection of SQL scripts demonstrating various analytical techniques, such as changes over time, cumulative, performance, data segmentation, part-to-whole analysis.
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Cortex | Data Framework—a cutting-edge SDK that simplifies real-time data processing with intuitive operators, robust state management, and seamless telemetry for efficient, scalable pipelines.
🧠Mindmap of 🗺️Software Architecture, Software engineering: An Overview of Software Terminologies and Concepts.
Slipstream provides a data-flow model to simplify development of stateful streaming applications.
Conduit streams data between data stores. Kafka Connect replacement. No JVM required.
A Pub/Sub for Tables based data integration platform, to discover, publish, modify and consume data effortlessly.
A curated list of open source tools used in analytics platforms and data engineering ecosystem
The data-validation toolkit for enhanced dbt (data build tool) PR review
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
A comprehensive guide to building a modern data warehouse with SQL Server, including ETL processes, data modeling, and analytics.
Modern serverless lakehouse implementing HOOK methodology, Unified Star Schema (USS), and Analytical Data Storage System (ADSS) principles on Adventure Works. Features programmatic model generation, e...
About The most comprehensive SQL guide from a real-world expert! Learn everything from basics to advanced queries, optimizations, and real-world SQL
Materials for the Deploy and Monitor ML Pipelines with Python, Docker and GitHub Actions workshop at the PyData NYC 2024 conference
Elusion is a high-performance DataFrame / Data Engineering / Data Analytics library for managing and querying data using a DataFrame-like interface.
Code for blog at https://www.startdataengineering.com/post/python-for-de/
PDF DataSource for Apache Spark, allow to read PDF files directly to the DataFrame and ocr it
RushDB is an instant database for modern apps and DS/ML ops built on top of Neo4j
Dataform Tools - VS Code extension to run and visualise Dataform data pipelines and much more
Interactive Python TUI for visualizing and analyzing files with multipe formats
This repository contains a collection of SQL scripts demonstrating various analytical techniques, such as changes over time, cumulative, performance, data segmentation, part-to-whole analysis.
📡 Real-time data pipeline with Kafka, Flink, Iceberg, Trino, MinIO, and Superset. Ideal for learning data systems.
Turns Data and AI algorithms into production-ready web applications in no time.
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
An orchestration platform for the development, production, and observation of data assets.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
Real-time data transformation framework for AI. Ultra performant, with incremental processing.
Learn to build your Second Brain AI assistant with LLMs, agents, RAG, fine-tuning, LLMOps and AI systems techniques.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
An Awesome List of Open-Source Data Engineering Projects
Collection of Snowflake Notebook demos, tutorials, and examples
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Data Engineering Project with Hadoop HDFS and Kafka
Code for "Efficient Data Processing in Spark" Course
A data-driven approach to predicting football match outcomes using advanced machine learning techniques. This project integrates various algorithms to forecast game results, providing insights for spo...
Never sift through endless dbt™ logs again. dbt Command Center is a free, open-source, local web application that provides a user-friendly interface to monitor and manage dbt runs.
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
A curated list of open source tools used in analytics platforms and data engineering ecosystem
A curated collection of AI, data engineering, and DevOps projects featuring real-world applications, advanced techniques, and tutorials—ideal for learners and practitioners exploring data science and ...
DataOps Data Quality TestGen is part of DataKitchen's Open Source Data Observability. DataOps TestGen delivers simple, fast data quality test generation and execution by data profiling, new dataset...
This repo contains "Databricks Certified Data Engineer Professional" Questions and related docs.
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
OpenSource data platform to build event-driven systems. It's like Deebezium for golang :)
A configuration-driven framework for building Dagster pipelines that enables teams to create and manage data workflows using YAML/JSON instead of code
DataOps Observability Integration Agents are part of DataKitchen's Open Source Data Observability. They connect to various ETL, ELT, BI, data science, data visualization, data governance, and data ana...