37 results found Sort:

370
4.6k
apache-2.0
42
lakeFS - Data version control for your data lake | Git for data
Created 2019-09-12
5,727 commits to master branch, last one 2 days ago
243
3.4k
apache-2.0
22
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Created 2022-01-26
3,549 commits to devel branch, last one 4 days ago
936
2.2k
apache-2.0
63
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Created 2017-12-18
4,265 commits to master branch, last one a day ago
331
1.7k
apache-2.0
62
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Created 2022-09-29
236 commits to master branch, last one about a year ago
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Created 2020-01-20
80 commits to master branch, last one 5 years ago
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Created 2020-02-13
50 commits to master branch, last one 5 years ago
577
1.1k
apache-2.0
112
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Created 2017-02-08
8,313 commits to master branch, last one 6 years ago
Personal Data Engineering Projects
Created 2020-04-20
65 commits to master branch, last one 2 years ago
32
671
apache-2.0
15
Data API Framework for AI Agents and Data Apps
Created 2022-04-27
1,056 commits to develop branch, last one 11 months ago
35
523
apache-2.0
6
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
Created 2024-04-05
672 commits to main branch, last one 23 hours ago
111
478
other
27
Generic Data Ingestion & Dispersal Library for Hadoop
This repository has been archived (exclude archived)
Created 2018-01-05
33 commits to master branch, last one 5 years ago
Enterprise-grade, production-hardened, serverless data lake on AWS
Created 2020-09-08
647 commits to main branch, last one 2 days ago
28
285
apache-2.0
11
Use SQL to build ELT pipelines on a data lakehouse.
Created 2021-03-11
481 commits to main branch, last one 2 years ago
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Created 2020-02-07
723 commits to master branch, last one 25 days ago
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Created 2023-05-22
7 commits to master branch, last one 10 months ago
679
234
mit
110
U-SQL Examples and Issue Tracking
Created 2015-10-13
253 commits to master branch, last one 2 years ago
60
226
apache-2.0
14
🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥
Created 2022-05-09
898 commits to main branch, last one 3 days ago
Resources for video demonstrations and blog posts related to DataOps on AWS
Created 2021-11-07
107 commits to main branch, last one 3 years ago
45
143
agpl-3.0
4
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Created 2019-06-21
1,439 commits to master branch, last one 13 days ago
105
139
mit
107
Samples and Docs for Azure Data Lake Store and Analytics
Created 2015-04-28
861 commits to master branch, last one 2 years ago
Apache Spark 3 - Structured Streaming Course Material
Created 2020-07-21
29 commits to master branch, last one 4 years ago
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Created 2019-08-07
2,046 commits to develop-spark3 branch, last one 5 days ago
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size. Demo example for NextJS.
Created 2023-09-13
24 commits to main branch, last one 7 months ago
10
103
other
1
A Git-like Version Control File System for AI & Data Product Management.
Created 2023-11-24
298 commits to main branch, last one 3 months ago
Apache Spark Course Material
Created 2020-05-05
34 commits to master branch, last one 4 years ago
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Created 2023-04-04
15 commits to main branch, last one about a year ago
GraphQL API for Zeebe data
Created 2020-02-03
781 commits to main branch, last one 5 months ago
10
61
apache-2.0
29
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
This repository has been archived (exclude archived)
Created 2018-01-29
135 commits to master branch, last one 6 months ago
2
56
other
10
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Created 2022-04-27
443 commits to main branch, last one 6 days ago
28
56
apache-2.0
2
Web UI for Amazon Athena
Created 2020-12-30
42 commits to master branch, last one 2 years ago