4 results found Sort:
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Created
2022-12-17
429 commits to main branch, last one 22 hours ago
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Created
2023-06-22
1 commits to master branch, last one 3 months ago
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Created
2023-10-09
1 commits to master branch, last one 3 months ago
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API (Write for Fun 👀~)
Created
2024-11-30
34 commits to main branch, last one 23 days ago