Trending repositories for language Cuda
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Flash Attention in ~100 lines of CUDA (forward pass only)
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing hig...
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.
Flash Attention in ~100 lines of CUDA (forward pass only)
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Flash Attention in ~100 lines of CUDA (forward pass only)
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
A throughput-oriented high-performance serving framework for LLMs
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
MD5 hash cracking with CUDA and Rust, implemented from scratch
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Flash Attention in ~100 lines of CUDA (forward pass only)
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
From zero to hero CUDA for accelerating maths and machine learning on GPU.
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.
A massively parallel, optimal functional runtime in Rust
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A throughput-oriented high-performance serving framework for LLMs
Flash Attention in ~100 lines of CUDA (forward pass only)
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Flash Attention in ~100 lines of CUDA (forward pass only)
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Differential Gaussian Rasterization with Depth forward and backward functionality
PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models