Statistics for language Cuda
RepositoryStats tracks 584,790 Github repositories, of these 341 are reported to use a primary language of Cuda.
Most starred repositories for language Cuda (view more)
Trending repositories for language Cuda (view more)
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
CUDA accelerated rasterization of gaussian splatting
Numerical experiments for the paper: "MPCGPU: Real-Time Nonlinear Model Predictive Control through Preconditioned Conjugate Gradient on the GPU"
Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
CUDA accelerated rasterization of gaussian splatting
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Numerical experiments for the paper: "MPCGPU: Real-Time Nonlinear Model Predictive Control through Preconditioned Conjugate Gradient on the GPU"
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
CUDA accelerated rasterization of gaussian splatting
SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
Batch computation of the linear assignment problem on GPU.
Flash Attention in ~100 lines of CUDA (forward pass only)
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
Flash Attention in ~100 lines of CUDA (forward pass only)
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Causal depthwise conv1d in CUDA, with a PyTorch interface