15 results found Sort:

267
3.0k
mit
56
Fast inference engine for Transformer models
Created 2019-09-23
2,167 commits to master branch, last one 2 days ago
This repository has no description...
Created 2016-08-09
138 commits to master branch, last one 2 years ago
203
1.0k
apache-2.0
58
Tuned OpenCL BLAS
Created 2015-05-30
1,482 commits to master branch, last one 15 days ago
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Created 2022-12-17
110 commits to main branch, last one a day ago
97
455
unknown
16
BLISlab: A Sandbox for Optimizing GEMM
Created 2016-04-20
176 commits to master branch, last one 4 years ago
15
265
apache-2.0
16
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
Created 2018-10-13
401 commits to master branch, last one 5 months ago
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Created 2021-04-25
18 commits to main branch, last one 2 years ago
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Created 2023-06-22
1 commits to master branch, last one 7 months ago
136
198
mit
55
Stretching GPU performance for GEMMs and tensor contractions.
Created 2015-11-05
5,439 commits to develop branch, last one 15 hours ago
45
134
gpl-2.0
20
DBCSR: Distributed Block Compressed Sparse Row matrix library
Created 2018-06-05
3,402 commits to develop branch, last one 24 hours ago
11
111
unknown
14
Single file libraries for C/C++
Created 2017-03-08
177 commits to master branch, last one 7 months ago
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Created 2020-09-08
43 commits to master branch, last one 2 years ago
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Created 2023-06-14
205 commits to main branch, last one a day ago
56
40
mit
14
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Created 2022-09-16
1,122 commits to develop branch, last one 23 hours ago
The simplest but fast implementation of matrix multiplication in CUDA.
Created 2024-04-05
14 commits to master branch, last one 13 days ago