Trending repositories for language Cuda
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
A throughput-oriented high-performance serving framework for LLMs
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python and C++ APIs, a PyQt GUI, and fully integrated with PyTorch
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Modified 3D Gaussian rasterizer for latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python and C++ APIs, a PyQt GUI, and fully integrated with PyTorch
LightwheelOcc: A 3D Occupancy Synthetic Dataset in Autonomous Driving
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
A throughput-oriented high-performance serving framework for LLMs
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python and C++ APIs, a PyQt GUI, and fully integrated with PyTorch
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
Modified 3D Gaussian rasterizer for latentSplat: Autoencoding Variational Gaussians for Fast Generalizable 3D Reconstruction
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python and C++ APIs, a PyQt GUI, and fully integrated with PyTorch
LightwheelOcc: A 3D Occupancy Synthetic Dataset in Autonomous Driving
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
A throughput-oriented high-performance serving framework for LLMs
Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)
Flash Attention in ~100 lines of CUDA (forward pass only)
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Batch computation of the linear assignment problem on GPU.
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
Speed up image preprocess with cuda when handle image or tensorrt inference
:tada: [CVPR 2024] Pytorch implementation of 'Har Far Can We Compress Instant-NGP Based NeRF?'
Flash Attention in ~100 lines of CUDA (forward pass only)
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
From zero to hero CUDA for accelerating maths and machine learning on GPU.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
A throughput-oriented high-performance serving framework for LLMs
Flash Attention in ~100 lines of CUDA (forward pass only)
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sg...
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Flash Attention in ~100 lines of CUDA (forward pass only)
Differentiable gaussian rasterization with depth, alpha, normal map and extra per-Gaussian attributes, also support camera pose gradient
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Causal depthwise conv1d in CUDA, with a PyTorch interface
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library