Trending repositories for language Cuda
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
CUDA accelerated rasterization of gaussian splatting
A throughput-oriented high-performance serving framework for LLMs
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sg...
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
Causal depthwise conv1d in CUDA, with a PyTorch interface
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API (Write for Fun 👀~)
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
A throughput-oriented high-performance serving framework for LLMs
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
Causal depthwise conv1d in CUDA, with a PyTorch interface
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
CUDA accelerated rasterization of gaussian splatting
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
A throughput-oriented high-performance serving framework for LLMs
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
CUDA C 编程权威指南代码实现 包含了书上第二章到第八章的大部分代码实现和作者笔记,全由作者本人手动实现,难免有错误的地方,请大家谨慎参考,非常欢迎对错误的指正。 如果有帮助的话请Star一下,对作者帮助很大,谢谢!
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
CPP Implementation of "ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation"
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API (Write for Fun 👀~)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
CPP Implementation of "ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation"
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API (Write for Fun 👀~)
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
CUDA accelerated rasterization of gaussian splatting
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
CPP Implementation of "ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation"
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Numerical experiments for the paper: "MPCGPU: Real-Time Nonlinear Model Predictive Control through Preconditioned Conjugate Gradient on the GPU"
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Flash Attention in ~100 lines of CUDA (forward pass only)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
From zero to hero CUDA for accelerating maths and machine learning on GPU.
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A throughput-oriented high-performance serving framework for LLMs
Flash Attention in ~100 lines of CUDA (forward pass only)
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Flash Attention in ~100 lines of CUDA (forward pass only)
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.
PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library
[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models