Trending repositories for topic hadoop
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Apache Doris is an easy-to-use, high performance and unified analytics database.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Data Science E-books, Interview Resources and Cheat-sheets
🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
The official home of the Presto distributed SQL query engine for big data
More than 2000+ Data engineer interview questions.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML...
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Data Science E-books, Interview Resources and Cheat-sheets
More than 2000+ Data engineer interview questions.
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
Apache Doris is an easy-to-use, high performance and unified analytics database.
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML...
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Scalable, redundant, and distributed object store for Apache Hadoop
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualizati...
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Apache Doris is an easy-to-use, high performance and unified analytics database.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Data Science E-books, Interview Resources and Cheat-sheets
The official home of the Presto distributed SQL query engine for big data
🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
More than 2000+ Data engineer interview questions.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Me...
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
Data Science E-books, Interview Resources and Cheat-sheets
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
More than 2000+ Data engineer interview questions.
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
This repository contains my Data Analytics portfolio projects ranging from SQL, Python, Tableau, Excel, and Hadoop (HiveQL).
Scalable, redundant, and distributed object store for Apache Hadoop
Apache Doris is an easy-to-use, high performance and unified analytics database.
数据建设与大数据技术知识体系,包含hadoop、hive、spark、flink主流框架和系列框架,数据中台、数据湖、数据治理、数仓建设、数据化转型等
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, So...
MooseFS – Open Source, Petabyte, Fault-Tolerant, Highly Performing, Scalable Network Distributed File System (Software-Defined Storage)
Taier is a big data development platform for submission, scheduling, operation and maintenance, and indicator information display
Set of Kubernetes solutions for reusing idle resources of nodes by running extra batch jobs
Apache Doris is an easy-to-use, high performance and unified analytics database.
🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
More than 2000+ Data engineer interview questions.
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
The official home of the Presto distributed SQL query engine for big data
Data Science E-books, Interview Resources and Cheat-sheets
Alluxio, data orchestration for analytics and machine learning in the cloud
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
Data Science E-books, Interview Resources and Cheat-sheets
More than 2000+ Data engineer interview questions.
This repository contains my Data Analytics portfolio projects ranging from SQL, Python, Tableau, Excel, and Hadoop (HiveQL).
上百本大数据电子书,附带下载链接,包括计算机基础,Java,hadoop,spark,flink,kafka,hbase,hive,数仓等
最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkab...
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We wil...
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
数据建设与大数据技术知识体系,包含hadoop、hive、spark、flink主流框架和系列框架,数据中台、数据湖、数据治理、数仓建设、数据化转型等
Scalable, redundant, and distributed object store for Apache Hadoop
Smart Automation Tool for building modern Data Lakes and Data Pipelines
IT Knowledge Base from 20 years in DevOps, Linux, Cloud, Big Data, AWS, GCP etc - gradually porting my large private knowledge base to public
Apache Doris is an easy-to-use, high performance and unified analytics database.
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AW...
🏆 零代码、全功能、强安全 ORM 库 🚀 后端接口和文档零代码,前端(客户端) 定制返回 JSON 的数据和结构。 🏆 A JSON Transmission Protocol and an ORM Library 🚀 provides APIs and Docs without writing any code.
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
The official home of the Presto distributed SQL query engine for big data
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math c...
More than 2000+ Data engineer interview questions.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Me...
Alluxio, data orchestration for analytics and machine learning in the cloud
Data Science E-books, Interview Resources and Cheat-sheets
Create a streaming data, transfer it to Kafka, modify it with PySpark, take it to ElasticSearch and MinIO
The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center ...
This repository contains my Data Analytics portfolio projects ranging from SQL, Python, Tableau, Excel, and Hadoop (HiveQL).
数据建设与大数据技术知识体系,包含hadoop、hive、spark、flink主流框架和系列框架,数据中台、数据湖、数据治理、数仓建设、数据化转型等
Data Science E-books, Interview Resources and Cheat-sheets
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.
More than 2000+ Data engineer interview questions.
1000+ DevOps Bash Scripts - AWS, GCP, Kubernetes, Docker, CI/CD, APIs, SQL, PostgreSQL, MySQL, Hive, Impala, Kafka, Hadoop, Jenkins, GitHub, GitLab, BitBucket, Azure DevOps, TeamCity, Spotify, MP3, LD...
CloudEon uses Kubernetes to install and deploy open-source big data components, enabling the containerized operation of an open-source big data platform. This allows you to reduce your focus on underl...
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为基础框架,采用MTV模式,数据库使用MongoDB、MySQL和Redis,以从豆瓣平台爬取的电影数据作为基础数据源,主要基于用户的基本信息和使用操作记录等行为信息来开发用户标签,并使用Hadoop、Spark大数据组件进行分析和处理的推荐系统。管理系统使用的是Django自带的管理系统,并使用si...
上百本大数据电子书,附带下载链接,包括计算机基础,Java,hadoop,spark,flink,kafka,hbase,hive,数仓等
Apache Wayang(incubating) is the first cross-platform data processing system.