Statistics for language Cuda
RepositoryStats tracks 596,217 Github repositories, of these 355 are reported to use a primary language of Cuda.
Most starred repositories for language Cuda (view more)
Trending repositories for language Cuda (view more)
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
CUDA accelerated rasterization of gaussian splatting
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
CPP Implementation of "ALIKED: A Lighter Keypoint and Descriptor Extraction Network via Deformable Transformation"
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
CUDA accelerated rasterization of gaussian splatting
A contact solver for physics-based simulations involving 👚 shells, 🪵 solids and 🪢 rods.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Flash Attention in ~100 lines of CUDA (forward pass only)
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Flash Attention in ~100 lines of CUDA (forward pass only)
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).