5 results found Sort:

22
261
apache-2.0
10
A fast communication-overlapping library for tensor parallelism on GPUs.
Created 2024-03-01
24 commits to main branch, last one 2 months ago
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
Created 2023-02-23
20 commits to main branch, last one 7 days ago
Examples of CUDA implementations by Cutlass CuTe
Created 2024-04-28
26 commits to main branch, last one about a month ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created 2023-08-16
1 commits to master branch, last one 4 months ago
1
30
bsd-3-clause
1
CUTLASS and CuTe Examples
Created 2024-07-29
166 commits to main branch, last one 2 days ago