36 results found Sort:
- Filter by Primary Language:
- Python (9)
- Java (5)
- Go (4)
- Scala (4)
- Jupyter Notebook (4)
- TypeScript (2)
- C# (1)
- Vue (1)
- C++ (1)
- Dockerfile (1)
- JavaScript (1)
- Kotlin (1)
- Rust (1)
- +
lakeFS - Data version control for your data lake | Git for data
Created
2019-09-12
5,459 commits to master branch, last one 18 hours ago
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Created
2022-01-26
3,374 commits to devel branch, last one 12 hours ago
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Created
2017-12-18
4,156 commits to master branch, last one 2 days ago
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Created
2022-09-29
236 commits to master branch, last one 10 months ago
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Created
2020-01-20
80 commits to master branch, last one 4 years ago
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Created
2020-02-13
50 commits to master branch, last one 4 years ago
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Created
2017-02-08
8,313 commits to master branch, last one 5 years ago
Personal Data Engineering Projects
Created
2020-04-20
65 commits to master branch, last one 2 years ago
Data API Framework for AI Agents and Data Apps
Created
2022-04-27
1,056 commits to develop branch, last one 6 months ago
Generic Data Ingestion & Dispersal Library for Hadoop
This repository has been archived
(exclude archived)
Created
2018-01-05
33 commits to master branch, last one 5 years ago
Enterprise-grade, production-hardened, serverless data lake on AWS
Created
2020-09-08
511 commits to main branch, last one 8 days ago
Use SQL to build ELT pipelines on a data lakehouse.
Created
2021-03-11
481 commits to main branch, last one 2 years ago
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Created
2020-02-07
720 commits to master branch, last one about a month ago
U-SQL Examples and Issue Tracking
Created
2015-10-13
253 commits to master branch, last one about a year ago
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Created
2023-05-22
7 commits to master branch, last one 6 months ago
Lakekeeper: A Rust native Iceberg REST Catalog
Created
2024-04-05
307 commits to main branch, last one 10 hours ago
Resources for video demonstrations and blog posts related to DataOps on AWS
Created
2021-11-07
107 commits to main branch, last one 2 years ago
Samples and Docs for Azure Data Lake Store and Analytics
Created
2015-04-28
861 commits to master branch, last one about a year ago
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Created
2019-06-21
1,399 commits to master branch, last one 2 days ago
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Created
2022-05-09
697 commits to main branch, last one 21 hours ago
Apache Spark 3 - Structured Streaming Course Material
Created
2020-07-21
29 commits to master branch, last one 4 years ago
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Created
2019-08-07
1,968 commits to develop-spark3 branch, last one 12 days ago
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Created
2023-09-13
24 commits to main branch, last one 2 months ago
Apache Spark Course Material
Created
2020-05-05
34 commits to master branch, last one 4 years ago
A Git-like Version Control File System for Datasets Management in the Era of AI.
Created
2023-11-24
292 commits to main branch, last one 6 days ago
GraphQL API for Zeebe data
Created
2020-02-03
781 commits to main branch, last one 12 days ago
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
This repository has been archived
(exclude archived)
Created
2018-01-29
135 commits to master branch, last one 2 months ago
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Created
2023-04-04
15 commits to main branch, last one about a year ago
Web UI for Amazon Athena
Created
2020-12-30
42 commits to master branch, last one 2 years ago
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Created
2022-04-27
425 commits to main branch, last one 9 days ago