Statistics for language Cuda
RepositoryStats tracks 579,556 Github repositories, of these 335 are reported to use a primary language of Cuda.
Most starred repositories for language Cuda (view more)
Trending repositories for language Cuda (view more)
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
A throughput-oriented high-performance serving framework for LLMs
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Modified 3D Gaussian rasterizer for latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
LightwheelOcc: A 3D Occupancy Synthetic Dataset in Autonomous Driving
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
A throughput-oriented high-performance serving framework for LLMs
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Modified 3D Gaussian rasterizer for latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (Third Edition)
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
Flash Attention in ~100 lines of CUDA (forward pass only)
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Flash Attention in ~100 lines of CUDA (forward pass only)
Differentiable gaussian rasterization with depth, alpha, normal map and extra per-Gaussian attributes, also support camera pose gradient
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving