6 results found Sort:

300
2.9k
gpl-3.0
22
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Created 2022-12-17
506 commits to main branch, last one 7 hours ago
47
776
apache-2.0
13
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
Created 2024-03-01
31 commits to main branch, last one a day ago
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
Created 2023-02-23
31 commits to main branch, last one 3 days ago
Examples of CUDA implementations by Cutlass CuTe
Created 2024-04-28
29 commits to main branch, last one about a month ago
4
42
bsd-3-clause
1
CUTLASS and CuTe Examples
Created 2024-07-29
166 commits to main branch, last one 2 months ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created 2023-08-16
1 commits to master branch, last one 20 days ago