Search Results - RepositoryStats

370

4.6k

apache-2.0

42

lakeFS - Data version control for your data lake | Git for data

go aws-s3 golang lakefs datalake data-lake datalakes apache-spark data-quality git-for-data azure-storage object-storage apache-sparksql data-versioning data-engineering hadoop-filesystem azure-blob-storage data-version-control google-cloud-storage

Created 2019-09-12

5,727 commits to master branch, last one 2 days ago

dlt dlt-hub

243

3.4k

apache-2.0

22

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

elt data load python extract data-lake transform data-loading data-warehouse data-engineering

Created 2022-01-26

3,549 commits to devel branch, last one 4 days ago

kyuubi apache

936

2.2k

apache-2.0

63

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

sql hive jdbc spark hadoop thrift data-lake spark-sql kubernetes hacktoberfest

Created 2017-12-18

4,265 commits to master branch, last one a day ago

bitsail bytedance

331

1.7k

apache-2.0

62

BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every d...

flink big-data data-lake real-time data-pipeline data-integration high-performance data-synchronization

Created 2022-09-29

236 commits to master branch, last one about a year ago

Udacity-Data-Engineering-Projects san089

507

1.6k

other

39

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Created 2020-01-20

80 commits to master branch, last one 5 years ago

goodreads_etl_pipeline san089

223

1.4k

mit

25

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Created 2020-02-13

50 commits to master branch, last one 5 years ago

kylo Teradata

577

1.1k

apache-2.0

112

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is license...

kylo nifi spark hadoop teradata data-lake

Created 2017-02-08

8,313 commits to master branch, last one 6 years ago

Data-Engineering-Projects alanchn31

203

919

unknown

8

Personal Data Engineering Projects

spark scrapy airflow mongodb postgres cassandra data-lake ingest-data star-schema aws-redshift data-modeling data-warehouse data-engineering data-engineering-nanodegree

Created 2020-04-20

65 commits to master branch, last one 2 years ago

vulcan-sql Canner

32

671

apache-2.0

15

Data API Framework for AI Agents and Data Apps

Created 2022-04-27

1,056 commits to develop branch, last one 11 months ago

lakekeeper lakekeeper

35

523

apache-2.0

6

Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.

rust catalog iceberg data-lake lakehouse open-lakehouse lakehouse-governance

Created 2024-04-05

672 commits to main branch, last one 23 hours ago

marmaray uber

111

478

other

27

Generic Data Ingestion & Dispersal Library for Hadoop

spark hadoop data-lake avro-schema ingest-data schema-format

This repository has been archived (exclude archived)

Created 2018-01-05

33 commits to master branch, last one 5 years ago

data-lakes-on-aws aws-solutions-library-samples

141

445

mit-0

33

Enterprise-grade, production-hardened, serverless data lake on AWS

aws etl iac analytics data-lake framework serverless best-practices lake-formation data-engineering

Created 2020-09-08

647 commits to main branch, last one 2 days ago

cuelake cuebook

28

285

apache-2.0

11

Use SQL to build ELT pipelines on a data lakehouse.

elt etl sql delta upsert datalake data-lake lakehouse pipelines spark-sql apache-spark data-pipeline data-transfer apache-iceberg data-ingestion data-engineering data-integration zeppelin-notebook incremental-updates

Created 2021-03-11

481 commits to main branch, last one 2 years ago

amazon-s3-find-and-forget awslabs

36

242

apache-2.0

14

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

s3 aws ccpa data gdpr parquet privacy big-data amazon-s3 data-lake data-erasure right-to-be-forgotten

Created 2020-02-07

723 commits to master branch, last one 25 days ago

btrblocks maxi-k

20

238

mit

7

BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)

research data-lake databases compression

Created 2023-05-22

7 commits to master branch, last one 10 months ago

usql Azure

679

234

mit

110

U-SQL Examples and Issue Tracking

azure u-sql big-data data-lake

Created 2015-10-13

253 commits to master branch, last one 2 years ago

wren-engine Canner

60

226

apache-2.0

14

🤖 The Semantic Engine for Model Context Protocol(MCP) Clients and AI Agents 🔥

ai llm mcp sql data agent semantic data-lake agentic-ai mcp-server data-analysis hacktoberfest data-analytics data-warehouse semantic-layer business-intelligence

Created 2022-05-09

898 commits to main branch, last one 3 days ago

tickit-data-lake-demo garystafford

108

172

unknown

4

Resources for video demonstrations and blog posts related to DataOps on AWS

aws devops airflow dataops redshift data-lake

Created 2021-11-07

107 commits to main branch, last one 3 years ago

pixels pixelsdb

45

143

agpl-3.0

4

An efficient storage and compute engine for both on-prem and cloud-native data analytics.

olap database data-lake column-store cloud-database data-warehouse

Created 2019-06-21

1,439 commits to master branch, last one 13 days ago

AzureDataLake Azure

105

139

mit

107

Samples and Docs for Azure Data Lake Store and Analytics

azure big-data data-lake

Created 2015-04-28

861 commits to master branch, last one 2 years ago

Spark-Streaming-In-Python LearningJournal

159

121

mit

7

Apache Spark 3 - Structured Streaming Course Material

python bigdata pyspark big-data data-lake spark-sql apache-spark spark-streaming

Created 2020-07-21

29 commits to master branch, last one 4 years ago

smart-data-lake smart-data-lake

22

120

gpl-3.0

14

Smart Automation Tool for building modern Data Lakes and Data Pipelines

hive scala spark hadoop data-lake deltalake data-pipelines transform-data smart-data-lake

Created 2019-08-07

2,046 commits to develop-spark3 branch, last one 5 days ago

r2-bucket-uploader datopian

11

115

mit

7

Cloudflare R2 bucket File Uploader with multipart upload enabled. Tested with files up to 10 GB size. Demo example for NextJS.

r2 s3 blob bucket data-lake cloudflare blob-storage object-storage

Created 2023-09-13

24 commits to main branch, last one 7 months ago

jzfs GitDataAI

10

103

other

1

A Git-like Version Control File System for AI & Data Product Management.

git jzfs aiops mlops dataops jiaozifs data-lake data-lineage data-product git-for-data digital-twins git-interface git-filesystem data-versioning data-collaboration federated-learning data-version-control version-controlled-filesystem

Created 2023-11-24

298 commits to main branch, last one 3 months ago

SparkProgrammingInScala LearningJournal

159

88

mit

9

Apache Spark Course Material

scala spark bigdata big-data datalake data-lake spark-sql spark-scala apache-spark

Created 2020-05-05

34 commits to master branch, last one 4 years ago

Local-Data-LakeHouse dominikhei

13

63

unknown

4

Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.

minio trino data-lake lakehouse apache-iceberg data-lakehouse hive-metastore

Created 2023-04-04

15 commits to main branch, last one about a year ago

zeeqs camunda-community-hub

15

62

apache-2.0

7

GraphQL API for Zeebe data

zeebe graphql data-lake zeebe-tool

Created 2020-02-03

781 commits to main branch, last one 5 months ago

lighthouse datamindedbe

10

61

apache-2.0

29

Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.

data-lake

This repository has been archived (exclude archived)

Created 2018-01-29

135 commits to master branch, last one 6 months ago

aligned MatsMoll

2

56

other

10

The DBT of ML, as Aligned describes data dependencies in ML systems, and reduce technical data debt

ai ml dbt mlops ml-ops data-lake datacontracts feature-store data-contracts feature-engineering

Created 2022-04-27

443 commits to main branch, last one 6 days ago

querypal OElesin

28

56

apache-2.0

2

Web UI for Amazon Athena

aws sql data analytics data-lake aws-athena

Created 2020-12-30

42 commits to master branch, last one 2 years ago