Trending repositories for topic spark
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Apache Spark - A unified analytics engine for large-scale data processing
Apache Doris is an easy-to-use, high performance and unified analytics database.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Learn and understand Docker&Container technologies, with real DevOps practice!
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
Open source project for data preparation of LLM application builders
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Open source project for data preparation of LLM application builders
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POC...
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
🐍 Quick reference guide to common patterns & functions in PySpark.
大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台,包含数据采集,调度,权限,审批流,私域营销等模块
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Apache Spark - A unified analytics engine for large-scale data processing
Apache Doris is an easy-to-use, high performance and unified analytics database.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Open source project for data preparation of LLM application builders
Learn and understand Docker&Container technologies, with real DevOps practice!
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Open source project for data preparation of LLM application builders
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
💭 一个可二次开发 Chat Bot 对话 Web 端 MVP 原型模板, 基于 Vue3、Vite 5、TypeScript、Naive UI 、UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it 预览, 💼 易于定制和快速搭建 Chat 类大语言模型产品 (...
Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
Toolbox for building Generative AI applications on top of Apache Spark.
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
🚀 讯飞星火大模型逆向API白嫖测试【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹。
✏️[计算机基础+java基础+大数据基础及进阶+面试指南] 一份涵盖计算机基础,java,大数据,面试宝典,大部分核心知识的项目,学习,面试,共同进步!
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
Apache Spark - A unified analytics engine for large-scale data processing
Apache Doris is an easy-to-use, high performance and unified analytics database.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Learn and understand Docker&Container technologies, with real DevOps practice!
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
Open source project for data preparation of LLM application builders
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
💭 一个可二次开发 Chat Bot 对话 Web 端 MVP 原型模板, 基于 Vue3、Vite 5、TypeScript、Naive UI 、UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it 预览, 💼 易于定制和快速搭建 Chat 类大语言模型产品 (...
Open source project for data preparation of LLM application builders
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>
Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
Toolbox for building Generative AI applications on top of Apache Spark.
Know your data better!Datavines is Next-gen Data Observability Platform, support metadata manage and data quality.
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Full stack data engineering tools and infrastructure set-up
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Open source project for data preparation of LLM application builders
🚀 讯飞星火大模型逆向API白嫖测试【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹。
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Full-stack Highly Scalable Cloud-native Machine Learning system for demand forecasting with realtime data streaming, inference, retraining loop, and more
💭 一个可二次开发 Chat Bot 对话 Web 端 MVP 原型模板, 基于 Vue3、Vite 5、TypeScript、Naive UI 、UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it 预览, 💼 易于定制和快速搭建 Chat 类大语言模型产品 (...
Spark fires is a anti-pattern playground where we deliberately break Spark applications in various ways so you can observe what happens and potentially recognise the issue when you come across it in y...
A Python package to submit and manage Apache Spark applications on Kubernetes.
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Apache Spark - A unified analytics engine for large-scale data processing
Apache Doris is an easy-to-use, high performance and unified analytics database.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
🧙 Build, run, and manage data pipelines for integrating and transforming data.
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
Learn and understand Docker&Container technologies, with real DevOps practice!
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、...
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维指标监控等功能。系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现各种海量数据实时统计,帮助企业以更低的成本快速搭建起数据指标体系,是企业...
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
A Python package to submit and manage Apache Spark applications on Kubernetes.
The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Prod...
Full-stack Highly Scalable Cloud-native Machine Learning system for demand forecasting with realtime data streaming, inference, retraining loop, and more
🎨 UI for the Free Data Engineering Zoomcamp Course provided by DataTalksClub
Create a streaming data, transfer it to Kafka, modify it with PySpark, take it to ElasticSearch and MinIO
智元 IIM 是一款开源的网页版即时聊天系统, 同时拥有AI聊天对话功能, 支持ChatGPT、Midjourney、文心一言、讯飞星火、通义千问等AI助手功能
A simple VS Code devcontainer setup for local PySpark development
多平台模型接入,可扩展,多种输出格式,提供大语言模型聊天服务的插件 | A bot plugin for LLM chat services with multi-model integration, extensibility, and various output formats
Data Engineering examples for Airflow, Prefect, and Mage.ai; dbt for BigQuery, Redshift, ClickHouse, PostgreSQL; Spark/PySpark for Batch processing; and Kafka for Stream processing
Playground for Lakehouse (Iceberg, Hudi, Spark, Flink, Trino, DBT, Airflow, Kafka, Debezium CDC)
Sample project to demonstrate data engineering best practices
智元 IIM 是一款开源的网页版即时聊天系统, 同时拥有AI聊天对话功能, 支持ChatGPT、Midjourney、文心一言、讯飞星火、通义千问等AI助手功能
Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.
This project implements an ELT (Extract - Load - Transform) data pipeline with the goodreads dataset, using dagster (orchestration), spark (calculation) and dbt (transformation)
The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center ...