36 results found Sort:
- Filter by Primary Language:
- Python (9)
- Java (5)
- Go (4)
- Scala (4)
- Jupyter Notebook (4)
- TypeScript (2)
- C# (1)
- Vue (1)
- C++ (1)
- Dockerfile (1)
- JavaScript (1)
- Kotlin (1)
- Rust (1)
- +
lakeFS - Data version control for your data lake | Git for data
Created
2019-09-12
5,474 commits to master branch, last one 23 hours ago
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Created
2022-01-26
3,401 commits to devel branch, last one 21 hours ago
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Created
2017-12-18
4,167 commits to master branch, last one 2 days ago
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...
Created
2022-09-29
236 commits to master branch, last one 10 months ago
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Created
2020-01-20
80 commits to master branch, last one 4 years ago
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Created
2020-02-13
50 commits to master branch, last one 4 years ago
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...
Created
2017-02-08
8,313 commits to master branch, last one 5 years ago
Personal Data Engineering Projects
Created
2020-04-20
65 commits to master branch, last one 2 years ago
Data API Framework for AI Agents and Data Apps
Created
2022-04-27
1,056 commits to develop branch, last one 7 months ago
Generic Data Ingestion & Dispersal Library for Hadoop
This repository has been archived
(exclude archived)
Created
2018-01-05
33 commits to master branch, last one 5 years ago
Enterprise-grade, production-hardened, serverless data lake on AWS
Created
2020-09-08
559 commits to main branch, last one 18 hours ago
Use SQL to build ELT pipelines on a data lakehouse.
Created
2021-03-11
481 commits to main branch, last one 2 years ago
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Created
2020-02-07
720 commits to master branch, last one 2 months ago
U-SQL Examples and Issue Tracking
Created
2015-10-13
253 commits to master branch, last one about a year ago
Lakekeeper: A Rust native Iceberg REST Catalog
Created
2024-04-05
345 commits to main branch, last one a day ago
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Created
2023-05-22
7 commits to master branch, last one 6 months ago
Resources for video demonstrations and blog posts related to DataOps on AWS
Created
2021-11-07
107 commits to main branch, last one 2 years ago
An efficient storage and compute engine for both on-prem and cloud-native data analytics.
Created
2019-06-21
1,399 commits to master branch, last one 16 days ago
Samples and Docs for Azure Data Lake Store and Analytics
Created
2015-04-28
861 commits to master branch, last one about a year ago
🤖 The semantic engine for LLMs, bringing semantic context to AI agents. 🔥
Created
2022-05-09
726 commits to main branch, last one 9 hours ago
Apache Spark 3 - Structured Streaming Course Material
Created
2020-07-21
29 commits to master branch, last one 4 years ago
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Created
2019-08-07
1,968 commits to develop-spark3 branch, last one 26 days ago
Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size.
Created
2023-09-13
24 commits to main branch, last one 3 months ago
A Git-like Version Control File System for Datasets Management in the Era of AI.
Created
2023-11-24
296 commits to main branch, last one 4 days ago
Apache Spark Course Material
Created
2020-05-05
34 commits to master branch, last one 4 years ago
GraphQL API for Zeebe data
Created
2020-02-03
781 commits to main branch, last one 27 days ago
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
This repository has been archived
(exclude archived)
Created
2018-01-29
135 commits to master branch, last one 2 months ago
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Created
2023-04-04
15 commits to main branch, last one about a year ago
Web UI for Amazon Athena
Created
2020-12-30
42 commits to master branch, last one 2 years ago
The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt
Created
2022-04-27
425 commits to main branch, last one 23 days ago