Search Results - RepositoryStats

6.1k

38.4k

mit

1.2k

Learn how to design, develop, deploy and iterate on production-grade ML applications.

ray llms mlops python pytorch data-quality data-science deep-learning distributed-ml data-engineering machine-learning distributed-training natural-language-processing

Created 2018-11-05

18 commits to main branch, last one about a year ago

pytorch-image-models huggingface

4.9k

33.9k

apache-2.0

316

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT)...

Created 2019-02-02

2,711 commits to main branch, last one 4 days ago

Paddle PaddlePaddle

5.7k

22.7k

apache-2.0

715

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

python efficiency scalability paddlepaddle deep-learning neural-network machine-learning distributed-training

Created 2016-08-15

53,687 commits to develop branch, last one 10 hours ago

PaddleNLP PaddlePaddle

3.0k

12.5k

apache-2.0

101

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

llm nlp uie bert ernie llama embedding paddlenlp compression transformers neural-search search-engine pretrained-models semantic-analysis question-answering sentiment-analysis distributed-training document-intelligence information-extraction

Created 2021-02-05

5,815 commits to develop branch, last one a day ago

skypilot skypilot-org

621

7.7k

apache-2.0

70

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Created 2021-08-11

2,633 commits to master branch, last one 14 hours ago

Fengshenbang-LM IDEA-CCNL

380

4.1k

apache-2.0

58

Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系，成为中文AIGC和认知智能的基础设施。

aigc pytorch multimodal chinese-nlp transformers pretrained-models distributed-training

Created 2021-10-28

711 commits to main branch, last one about a year ago

FedML FedML-AI

745

3.8k

apache-2.0

94

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on a...

mlops edge-ai ai-agent deep-learning model-serving inference-engine machine-learning model-deployment federated-learning on-device-training distributed-training

Created 2020-07-21

12,120 commits to master branch, last one 11 months ago

byteps bytedance

491

3.7k

other

82

A high performance and generic framework for distributed DNN training

keras mxnet pytorch tensorflow deep-learning machine-learning distributed-training

Created 2019-06-25

432 commits to master branch, last one 3 years ago

adanet tensorflow

529

3.5k

apache-2.0

171

Fast and flexible AutoML with learning guarantees.

gpu tpu automl python ensemble tensorflow deep-learning learning-theory machine-learning distributed-training neural-architecture-search

Created 2018-06-28

440 commits to master branch, last one 3 years ago

determined determined-ai

364

3.1k

apache-2.0

82

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

keras mlops pytorch kubernetes tensorflow ml-platform data-science deep-learning machine-learning ml-infrastructure distributed-training hyperparameter-search hyperparameter-tuning hyperparameter-optimization

Created 2020-04-07

8,394 commits to main branch, last one about a month ago

alpa alpa-projects

360

3.1k

apache-2.0

46

Training and serving large-scale neural networks with auto parallelization.

jax llm alpa compiler deep-learning machine-learning auto-parallelization distributed-training distributed-computing high-performance-computing

This repository has been archived (exclude archived)

Created 2021-02-22

668 commits to main branch, last one about a year ago

hivemind learning-at-home

184

2.2k

mit

55

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

dht asyncio pytorch hivemind deep-learning neural-networks machine-learning mixture-of-experts distributed-systems volunteer-computing distributed-training asynchronous-programming

Created 2020-02-27

594 commits to master branch, last one 7 days ago

dlrover intelligent-machine-learning

176

1.4k

other

44

DLRover: An Automatic Distributed Deep Learning System

k8s llm-training hacktoberfest distributed-training

Created 2022-06-24

2,976 commits to master branch, last one 3 days ago

gloo facebookincubator

324

1.3k

other

59

Collective communications library with various primitives for multi-machine training.

pytorch collectives distributed-training

Created 2017-02-03

502 commits to main branch, last one 22 hours ago

HyperPose tensorlayer

274

1.3k

unknown

57

Library for Fast and Flexible Human Pose Estimation

openpose tensorrt mobilenet tensorflow tensorlayer computer-vision neural-networks pose-estimation distributed-training

Created 2018-08-25

538 commits to master branch, last one 3 years ago

DeepRec DeepRec-AI

361

1.1k

apache-2.0

35

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

python advertising scalability deep-learning search-engine machine-learning distributed-training recommendation-engine

Created 2021-12-24

65,623 commits to main branch, last one 3 months ago

efficient-dl-systems mryab

132

814

mit

13

Efficient Deep Learning Systems course materials (HSE, YSDA)

cuda mlops pytorch deep-learning machine-learning ml-infrastructure distributed-training efficient-deep-learning

Created 2021-12-06

193 commits to main branch, last one 2 days ago

relora Guitaricet

38

452

apache-2.0

9

Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

nlp peft llama transformer deep-learning distributed-training

Created 2023-04-27

217 commits to main branch, last one about a year ago

adaptdl petuum

79

438

apache-2.0

10

Resource-adaptive cluster scheduler for deep learning training.

aws cloud pytorch kubernetes deep-learning machine-learning distributed-systems distributed-training

Created 2020-08-23

123 commits to master branch, last one 2 years ago

distributed-training-guide LambdaLabsML

30

405

mit

6

Best practices & guides on how to write distributed pytorch training code

gpu mpi cuda fsdp nccl slurm cluster pytorch sharding deepspeed kuberentes lambdalabs gpu-cluster distributed-training

Created 2024-07-31

271 commits to main branch, last one 2 months ago

libai Oneflow-Inc

56

402

apache-2.0

41

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

nlp oneflow large-scale transformer deep-learning data-parallelism model-parallelism vision-transformer distributed-training pipeline-parallelism self-supervised-learning

Created 2021-10-25

358 commits to main branch, last one 6 months ago

torchx pytorch

129

360

other

20

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

ray slurm python airflow pytorch aws-batch pipelines components kubernetes deep-learning machine-learning distributed-training

Created 2021-05-04

802 commits to main branch, last one a day ago

HyperGBM DataCanvasIO

47

347

apache-2.0

15

A full pipeline AutoML tool for tabular data

gbm dask automl sklearn xgboost catboost lightgbm rapidsai datacleaning fullpipeline tabular-data preprocessing pseudo-labeling dask-distributed gpu-acceleration ensemble-learning distributed-training adversarial-validation semi-supervised-learning

Created 2020-10-22

772 commits to main branch, last one 10 days ago

oat sail-sg

22

331

apache-2.0

6

🌾 OAT: A research-friendly framework for LLM online alignment, including preference learning, reinforcement learning, etc.

dpo llm ppo grpo rlhf r1-zero alignment online-rl reasoning llm-aligment distributed-rl dueling-bandits llm-exploration online-alignment thompson-sampling distributed-training

Created 2024-10-15

36 commits to main branch, last one 8 days ago

KungFu lsds

59

294

apache-2.0

22

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

keras tensorflow distributed-systems distributed-training

Created 2018-12-29

384 commits to main branch, last one about a year ago

HandyRL DeNA

43

289

other

12

HandyRL is a handy and simple framework based on Python and PyTorch for distributed reinforcement learning that is applicable to your own environments.

games pytorch deep-learning policy-gradient machine-learning distributed-training reinforcement-learning

Created 2020-06-03

813 commits to master branch, last one 2 months ago

NanoDL HMUNACHI

10

286

mit

8

A Jax-based library for designing and training small transformers.

gpt jax nlp flax llama mistral attention transformer deep-learning machine-learning attention-mechanism distributed-training

Created 2023-08-22

158 commits to main branch, last one 8 months ago

awsome-distributed-training aws-samples

115

282

mit-0

14

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.

aws efa eks gpu awsbatch hyperpod llm-training generative-ai parallelcluster distributed-training

Created 2023-09-30

1,184 commits to main branch, last one 2 days ago

EasyParallelLibrary alibaba

49

267

apache-2.0

12

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

gpu deep-learning data-parallelism memory-efficient model-parallelism distributed-training pipeline-parallelism

Created 2022-02-23

21 commits to main branch, last one 2 years ago

fms-fsdp foundation-model-stack

38

240

apache-2.0

11

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

llm pytorch distributed-training

Created 2024-02-05

362 commits to main branch, last one 2 months ago