Trending repositories for topic spark
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
An open source, standard data file format for graph data storage and retrieval.
Apache Spark - A unified analytics engine for large-scale data processing
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Open source project for data preparation of LLM application builders
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,标注平台,自动化标注,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA...
🧙 Build, run, and manage data pipelines for integrating and transforming data.
【大厂面试专栏】一份Java程序员需要的技术指南,这里有面试题、系统架构、职场锦囊、主流中间件等,让你成为更牛的自己!
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Learn and understand Docker&Container technologies, with real DevOps practice!
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin...
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
An open source, standard data file format for graph data storage and retrieval.
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
Open source project for data preparation of LLM application builders
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
数据建设与大数据技术知识体系,包含hadoop、hive、spark、flink主流框架和系列框架,数据中台、数据湖、数据治理、数仓建设、数据化转型等
Uniffle is a high performance, general purpose Remote Shuffle Service.
新一代实时计算底座,计算性能超越flink/spark 100倍,XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维监控等功能,系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现业务全链路...
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,标注平台,自动化标注,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA...
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin...
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!
A curated list of awesome resources related to the Ada and SPARK programming language
DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Spark - A unified analytics engine for large-scale data processing
Open source project for data preparation of LLM application builders
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,标注平台,自动化标注,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA...
Apache Doris is an easy-to-use, high performance and unified analytics database.
An open source, standard data file format for graph data storage and retrieval.
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
🧙 Build, run, and manage data pipelines for integrating and transforming data.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
Learn and understand Docker&Container technologies, with real DevOps practice!
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin...
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Me...
An open source, standard data file format for graph data storage and retrieval.
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Open source project for data preparation of LLM application builders
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Sample project to demonstrate data engineering best practices
A curated list of awesome resources related to the Ada and SPARK programming language
数据建设与大数据技术知识体系,包含hadoop、hive、spark、flink主流框架和系列框架,数据中台、数据湖、数据治理、数仓建设、数据化转型等
✨Spark is a web-based, cross-platform and full-featured Remote Administration Tool (RAT) written in Go that allows you control all your devices anywhere. Spark是一个Go编写的,网页UI、跨平台以及多功能的远程控制和监控工具,你可以随时随地监...
Uniffle is a high performance, general purpose Remote Shuffle Service.
More than 2000+ Data engineer interview questions.
新一代实时计算底座,计算性能超越flink/spark 100倍,XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维监控等功能,系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现业务全链路...
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Spark - A unified analytics engine for large-scale data processing
Apache Doris is an easy-to-use, high performance and unified analytics database.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,标注平台,自动化标注,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA...
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
Learn and understand Docker&Container technologies, with real DevOps practice!
Open source project for data preparation of LLM application builders
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
🧙 Build, run, and manage data pipelines for integrating and transforming data.
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Implementing best practices for PySpark ETL jobs and applications.
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin...
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Formally verified, real-time capable, UNIX-like operating system kernel written in SPARK and Ada.
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Open source project for data preparation of LLM application builders
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>
Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club
End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)
基于Spark+SparkMLlib+Debezium打造的简单易用、超高性能大数据治理引擎,适用于批流一体的数据集成和数据分析,支持机器学习算法模型、支持CDC实时数据采集,支持数据质量校验、数据建模、算法建模和OLAP数据分析
An open source, standard data file format for graph data storage and retrieval.
Projects done in the Data Engineer Nanodegree Program by Udacity.com
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Spark Accelerator framework ; It enables secondary indices to remote data stores.
End to end data engineering project with kafka, airflow, spark, postgres and docker.
企业级 LLM API 快速集成系统,支持OpenAI、Azure、文心一言、讯飞星火、通义千问、智谱GLM、Gemini、DeepSeek、Anthropic Claude以及OpenAI格式的模型等,简洁的页面风格,轻量高效且稳定,支持Docker一键部署。
Open source project for data preparation of LLM application builders
💭 一个可二次开发 Chat Bot 单轮对话 Web 端 MVP 原型模板, 基于 Vue 3, Vite 6, TypeScript, Naive UI, Pinia(v3), UnoCSS 等主流技术构建, 🧤简单集成大模型 API, 采用单轮 AI 问答对话模式, 每次提问独立响应, 无需上下文, 支持打字机效果流式输出, 集成 markdown-it Mermaid/KaTex/L...
Open Source LeetCode for PySpark, Spark, Pandas and DBT/Snowflake
🚀 讯飞星火大模型逆向API【特长:办公助手】,支持高速流式输出、智能体对话、联网搜索、AI绘图、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。。
🎹 Moodify - an emotion-based music recommendation system that uses AI/ML models to analyze text, speech, and facial expressions, providing personalized music recommendations across web and mobile pla...
Detailed notes and homeworks from 2025 Data Engineering Zoomcamp by Datatalks.Club
End-to-end data platform: A PoC Data Platform project utilizing modern data stack (Spark, Airflow, DBT, Trino, Lightdash, Hive metastore, Minio, Postgres)
基于Spark+SparkMLlib+Debezium打造的简单易用、超高性能大数据治理引擎,适用于批流一体的数据集成和数据分析,支持机器学习算法模型、支持CDC实时数据采集,支持数据质量校验、数据建模、算法建模和OLAP数据分析
Formally verified, real-time capable, UNIX-like operating system kernel written in SPARK and Ada.
SparkConnect Server plugin and protobuf messages for the Amazon Deequ Data Quality Engine.
Calculating Word Count for coresponding english and persian text for making Dictionary with spark
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
Apache Spark - A unified analytics engine for large-scale data processing
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Apache Doris is an easy-to-use, high performance and unified analytics database.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,标注平台,自动化标注,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA...
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
🧙 Build, run, and manage data pipelines for integrating and transforming data.
Learn and understand Docker&Container technologies, with real DevOps practice!
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
LakeSail's computation framework with a mission to unify batch processing, stream processing, and compute-intensive (AI) workloads.
Open source project for data preparation of LLM application builders
Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Big data computing platform based on Spark <至轻云-超轻量级大数据计算平台/数据中台>
This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.
Large Tech Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
新一代实时计算底座,计算性能超越flink/spark 100倍,XL-LightHouse是一套支持超大数据量、支持超高并发的通用型流式大数据统计系统【同时支持单机版】。常见的应用场景包括:PV、UV统计;电商销售额、下单用户数统计;日志量统计;接口调用量、异常量、耗时情况统计;服务器运维监控等功能,系统支持多维度统计,支持各种复杂的条件筛选和逻辑判断,一键部署,一行代码接入,轻松实现业务全链路...
This project demonstrates how to use Apache Airflow to submit jobs to Apache spark cluster in different programming laguages using Python, Scala and Java as an example.
Full-stack Highly Scalable Cloud-native Machine Learning system for demand forecasting with realtime data streaming, inference, retraining loop, and more
End to end data engineering project with kafka, airflow, spark, postgres and docker.
Spark Accelerator framework ; It enables secondary indices to remote data stores.
Entity Matching Model solves the problem of matching company names between two possibly very large datasets.
A simple VS Code devcontainer setup for local PySpark development
Declarative text based tool for data analysts and engineers to extract, load, transform and orchestrate their data pipelines.
✏️[计算机基础+java基础+大数据基础及进阶+面试指南] 一份涵盖计算机基础,java,大数据,面试宝典,大部分核心知识的项目,学习,面试,共同进步!
Playground for Lakehouse (Iceberg, Hudi, Spark, Flink, Trino, DBT, Airflow, Kafka, Debezium CDC)
智元 IIM 是一款开源的网页版即时聊天系统, 同时拥有AI聊天对话功能, 支持ChatGPT、Midjourney、文心一言、讯飞星火、通义千问等AI助手功能