21 results found Sort:

338
3.7k
mit
59
Fast inference engine for Transformer models
Created 2019-09-23
2,191 commits to master branch, last one 21 days ago
297
2.9k
gpl-3.0
22
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Created 2022-12-17
505 commits to main branch, last one 2 days ago
This repository has no description...
Created 2016-08-09
138 commits to master branch, last one 2 years ago
204
1.1k
apache-2.0
57
Tuned OpenCL BLAS
Created 2015-05-30
1,483 commits to master branch, last one 4 months ago
107
506
unknown
15
BLISlab: A Sandbox for Optimizing GEMM
Created 2016-04-20
176 commits to master branch, last one 5 years ago
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Created 2023-06-22
1 commits to master branch, last one 6 months ago
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Created 2024-07-01
90 commits to main branch, last one 26 days ago
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Created 2021-04-25
20 commits to main branch, last one 2 months ago
14
285
apache-2.0
13
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
Created 2018-10-13
401 commits to master branch, last one about a year ago
158
233
mit
55
Stretching GPU performance for GEMMs and tensor contractions.
Created 2015-11-05
5,574 commits to develop branch, last one 5 days ago
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
Created 2023-02-23
31 commits to main branch, last one 2 days ago
48
140
gpl-2.0
19
DBCSR: Distributed Block Compressed Sparse Row matrix library
Created 2018-06-05
3,497 commits to develop branch, last one 4 days ago
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Created 2020-09-08
43 commits to master branch, last one 3 years ago
11
120
unknown
13
Single file libraries for C/C++
Created 2017-03-08
178 commits to master branch, last one 7 months ago
111
81
mit
15
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Created 2022-09-16
1,822 commits to develop branch, last one a day ago
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Created 2023-10-09
1 commits to master branch, last one 6 months ago
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Created 2023-06-14
245 commits to main branch, last one about a month ago
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Created 2024-04-10
14 commits to master branch, last one 3 months ago
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Created 2024-06-06
93 commits to master branch, last one 24 days ago
Serial and parallel implementations of matrix multiplication
Created 2020-06-21
60 commits to master branch, last one 4 years ago
The simplest but fast implementation of matrix multiplication in CUDA.
Created 2024-04-05
33 commits to master branch, last one 7 months ago