36 results found Sort:

352
4.4k
apache-2.0
43
lakeFS - Data version control for your data lake | Git for data
Created 2019-09-12
5,459 commits to master branch, last one 18 hours ago
172
2.6k
apache-2.0
19
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Created 2022-01-26
3,374 commits to devel branch, last one 12 hours ago
916
2.1k
apache-2.0
62
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Created 2017-12-18
4,156 commits to master branch, last one 2 days ago
335
1.6k
apache-2.0
62
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Created 2022-09-29
236 commits to master branch, last one 10 months ago
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Created 2020-01-20
80 commits to master branch, last one 4 years ago
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Created 2020-02-13
50 commits to master branch, last one 4 years ago
580
1.1k
apache-2.0
114
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Created 2017-02-08
8,313 commits to master branch, last one 5 years ago
Personal Data Engineering Projects
Created 2020-04-20
65 commits to master branch, last one 2 years ago
28
640
apache-2.0
13
Data API Framework for AI Agents and Data Apps
Created 2022-04-27
1,056 commits to develop branch, last one 6 months ago
111
479
other
28
Generic Data Ingestion & Dispersal Library for Hadoop
This repository has been archived (exclude archived)
Created 2018-01-05
33 commits to master branch, last one 5 years ago
Enterprise-grade, production-hardened, serverless data lake on AWS
Created 2020-09-08
511 commits to main branch, last one 8 days ago
28
285
apache-2.0
12
Use SQL to build ELT pipelines on a data lakehouse.
Created 2021-03-11
481 commits to main branch, last one 2 years ago
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Created 2020-02-07
720 commits to master branch, last one about a month ago
683
234
mit
112
U-SQL Examples and Issue Tracking
Created 2015-10-13
253 commits to master branch, last one about a year ago
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Created 2023-05-22
7 commits to master branch, last one 6 months ago
14
214
apache-2.0
3
Lakekeeper: A Rust native Iceberg REST Catalog
Created 2024-04-05
307 commits to main branch, last one 10 hours ago
Resources for video demonstrations and blog posts related to DataOps on AWS
Created 2021-11-07
107 commits to main branch, last one 2 years ago
105
139
mit
108
Samples and Docs for Azure Data Lake Store and Analytics
Created 2015-04-28
861 commits to master branch, last one about a year ago
35
136
agpl-3.0
6
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Created 2019-06-21
1,399 commits to master branch, last one 2 days ago
34
124
apache-2.0
13
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Created 2022-05-09
697 commits to main branch, last one 21 hours ago
Apache Spark 3 - Structured Streaming Course Material
Created 2020-07-21
29 commits to master branch, last one 4 years ago
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Created 2019-08-07
1,968 commits to develop-spark3 branch, last one 12 days ago
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Created 2023-09-13
24 commits to main branch, last one 2 months ago
Apache Spark Course Material
Created 2020-05-05
34 commits to master branch, last one 4 years ago
GraphQL API for Zeebe data
Created 2020-02-03
781 commits to main branch, last one 12 days ago
10
60
apache-2.0
27
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
This repository has been archived (exclude archived)
Created 2018-01-29
135 commits to master branch, last one 2 months ago
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Created 2023-04-04
15 commits to main branch, last one about a year ago
26
55
apache-2.0
3
Web UI for Amazon Athena
Created 2020-12-30
42 commits to master branch, last one 2 years ago
1
52
apache-2.0
11
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Created 2022-04-27
425 commits to main branch, last one 9 days ago