Trending repositories for topic data-engineering
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Turns Data and AI algorithms into production-ready web applications in no time.
More than 2000+ Data engineer interview questions.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
🧙 Build, run, and manage data pipelines for integrating and transforming data.
An orchestration platform for the development, production, and observation of data assets.
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
An Awesome List of Open-Source Data Engineering Projects
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytical stacks and data engineering ecosystem
More than 2000+ Data engineer interview questions.
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
An Awesome List of Open-Source Data Engineering Projects
Compute over Data framework for public, transparent, and optionally verifiable computation
Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for the Snowflake...
A lightweight CLI tool for versioning data alongside source code and building data pipelines.
Implementing best practices for PySpark ETL jobs and applications.
Practical Data Engineering: A Hands-On Real-Estate Project Guide
Turns Data and AI algorithms into production-ready web applications in no time.
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
Apache Superset is a Data Visualization and Data Exploration Platform
An Awesome List of Open-Source Data Engineering Projects
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
An orchestration platform for the development, production, and observation of data assets.
A list of useful resources to learn Data Engineering from scratch
More than 2000+ Data engineer interview questions.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Collection of Snowflake Notebook demos, tutorials, and examples
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
An Awesome List of Open-Source Data Engineering Projects
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
More than 2000+ Data engineer interview questions.
A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
Code for "Efficient Data Processing in Spark" Course
Turns Data and AI algorithms into production-ready web applications in no time.
The best place to learn data engineering. Built and maintained by the data engineering community.
A DuckDB extension to read data directly from databases supporting the ODBC interface
Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for the Snowflake...
Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Apache Superset is a Data Visualization and Data Exploration Platform
Turns Data and AI algorithms into production-ready web applications in no time.
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
An Awesome List of Open-Source Data Engineering Projects
An orchestration platform for the development, production, and observation of data assets.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
A list of useful resources to learn Data Engineering from scratch
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
The best place to learn data engineering. Built and maintained by the data engineering community.
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.
Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility acr...
Code for "Efficient Data Processing in Spark" Course
DataPulse is a platform for developers to build, schedule and monitor data pipelines.
DataOps Observability is part of DataKitchen's Open Source Data Observability. DataOps Observability monitors every data journey from data source to customer value, from any team development environm...
A curated list of open source tools used in analytical stacks and data engineering ecosystem
End to end data engineering project with kafka, airflow, spark, postgres and docker.
🥪🦘 An open source sandbox project exploring dbt workflows via a fictional sandwich shop's data.
A Python package that creates fine-grained dbt tasks on Apache Airflow
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
An Awesome List of Open-Source Data Engineering Projects
Compilation of high-profile real-world examples of failed machine learning projects
More than 2000+ Data engineer interview questions.
PySpark Tutorial for Beginners - Practical Examples in Jupyter Notebook with Spark version 3.4.1. The tutorial covers various topics like Spark Introduction, Spark Installation, Spark RDD Transformati...
The best place to learn data engineering. Built and maintained by the data engineering community.
This documentation is like a quick snapshot of my project in the data field, showing off my skills and know-how in this area.
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
🔥🔥🔥 Open Source Alternative to Hightouch, Census, and RudderStack - Reverse ETL & Customer Data Platform (CDP)
VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
Un repositorio más con conceptos básicos, desafíos técnicos y recursos sobre ingeniería de datos en español 🧙✨
Sample project to demonstrate data engineering best practices
(WIP) Getting started with Docker - An introduction to Docker with data science and engineering applications
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All compone...
Code for "Efficient Data Processing in Spark" Course
Collection of Snowflake Notebook demos, tutorials, and examples
A curated list of open source tools used in analytical stacks and data engineering ecosystem
breadroll 🥟 is a simple lightweight library for data processing operations written in Typescript and powered by Bun.
Turns Data and AI algorithms into production-ready web applications in no time.
Apache Superset is a Data Visualization and Data Exploration Platform
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Infinitely scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines
An orchestration platform for the development, production, and observation of data assets.
Learn how to design, develop, deploy and iterate on production-grade ML applications.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
SQL stream processing, analytics, and management. We decouple storage and compute to offer instant failover, dynamic scaling, speedy bootstrapping, and efficient joins.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Business intelligence as code: build fast, interactive data visualizations in pure SQL and markdown
Distributed DataFrame for Python designed for the cloud, powered by Rust
🎨 UI for the Free Data Engineering Zoomcamp Course provided by DataTalksClub
Turns Data and AI algorithms into production-ready web applications in no time.
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Code for "Efficient Data Processing in Spark" Course
A DuckDB extension to read data directly from databases supporting the ODBC interface
Jayvee is a domain-specific language and runtime for automated processing of data pipelines
Cookbook to provide solutions to common tasks and problems in using Polars with R
Snowflake infrastructure-as-code. Provision environments, automate deploys, CI/CD. Manage RBAC, users, roles, and data access. Declarative Python Resource API. Change Management tool for the Snowflake...
Prism is the easiest way to develop, orchestrate, and execute data pipelines in Python.
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
The REST API and execution engine for the Didact Platform.
Service for bulk-loading data to databases with automatic schema management (Redshift, Snowflake, BigQuery, ClickHouse, Postgres, MySQL)
Orbital automates integration between data sources (APIs, Databases, Queues and Functions). BFF's, API Composition and ETL pipelines that adapt as your specs change.
A software engineering framework to jump start your machine learning projects
A curated list of open source tools used in analytical stacks and data engineering ecosystem
Data Science boot camp aims to make the field of data science accessible and understandable to a wide range of individuals, regardless of their background or expertise.
A Python package that creates fine-grained dbt tasks on Apache Airflow
My Digital Palace - A Personal Journal for Reflection - A place to store all my thoughts