Statistics for language Cuda
RepositoryStats tracks 640,578 Github repositories, of these 395 are reported to use a primary language of Cuda.
Most starred repositories for language Cuda (view more)
Trending repositories for language Cuda (view more)
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
flash attention tutorial written in python, triton, cuda, cutlass
The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
This package contains the original 2012 AlexNet code.
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
This package contains the original 2012 AlexNet code.
DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM (RA-L 2025)
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
This package contains the original 2012 AlexNet code.
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
A massively parallel, optimal functional runtime in Rust
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
DDN-SLAM: Real-time Dense Dynamic Neural Implicit SLAM (RA-L 2025)
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.