15 results found Sort:
- Filter by Primary Language:
- Cuda (5)
- C (4)
- C++ (2)
- Assembly (1)
- Fortran (1)
- Nim (1)
- Python (1)
- +
Fast inference engine for Transformer models
Created
2019-09-23
2,167 commits to master branch, last one 2 days ago
This repository has no description...
Created
2016-08-09
138 commits to master branch, last one 2 years ago
Tuned OpenCL BLAS
Created
2015-05-30
1,482 commits to master branch, last one 15 days ago
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Created
2022-12-17
110 commits to main branch, last one a day ago
BLISlab: A Sandbox for Optimizing GEMM
Created
2016-04-20
176 commits to master branch, last one 4 years ago
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats a...
Created
2018-10-13
401 commits to master branch, last one 5 months ago
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Created
2021-04-25
18 commits to main branch, last one 2 years ago
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Created
2023-06-22
1 commits to master branch, last one 7 months ago
Stretching GPU performance for GEMMs and tensor contractions.
Created
2015-11-05
5,439 commits to develop branch, last one 15 hours ago
DBCSR: Distributed Block Compressed Sparse Row matrix library
Created
2018-06-05
3,402 commits to develop branch, last one 24 hours ago
Single file libraries for C/C++
Created
2017-03-08
177 commits to master branch, last one 7 months ago
Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.
Created
2020-09-08
43 commits to master branch, last one 2 years ago
FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
Created
2023-06-14
205 commits to main branch, last one a day ago
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
Created
2022-09-16
1,122 commits to develop branch, last one 23 hours ago
The simplest but fast implementation of matrix multiplication in CUDA.
Created
2024-04-05
14 commits to master branch, last one 13 days ago