Trending repositories for topic cuda
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
A high-throughput and memory-efficient inference and serving engine for LLMs
A highly optimized LLM inference acceleration engine for Llama and its variants.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
A highly optimized LLM inference acceleration engine for Llama and its variants.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
A fast communication-overlapping library for tensor parallelism on GPUs.
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.
A CUDA reimplementation of the line/plane odometry of LIO-SAM. A point cloud hash map (inspired by iVox of Faster-LIO) on GPU is used to accelerate 5-neighbour KNN search.
A high-throughput and memory-efficient inference and serving engine for LLMs
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
SGLang is a fast serving framework for large language models and vision language models.
(WIP) A small but powerful, homemade PyTorch from scratch.
A highly optimized LLM inference acceleration engine for Llama and its variants.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, 1.5x~2x🎉faster vs SDPA EA.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
(WIP) A small but powerful, homemade PyTorch from scratch.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
A highly optimized LLM inference acceleration engine for Llama and its variants.
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
Code for "EnvGS: Modeling View-Dependent Appearance with Environment Gaussian", arXiv 2024. Including a fully differentiable 2D Gaussian ray tracer built on 2DGS and OptiX, supporting multiple-bounce ...
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
Code for "EnvGS: Modeling View-Dependent Appearance with Environment Gaussian", arXiv 2024. Including a fully differentiable 2D Gaussian ray tracer built on 2DGS and OptiX, supporting multiple-bounce ...
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
(WIP) A small but powerful, homemade PyTorch from scratch.
A highly optimized LLM inference acceleration engine for Llama and its variants.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
Code for "EnvGS: Modeling View-Dependent Appearance with Environment Gaussian", arXiv 2024. Including a fully differentiable 2D Gaussian ray tracer built on 2DGS and OptiX, supporting multiple-bounce ...
A highly optimized LLM inference acceleration engine for Llama and its variants.
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, 1.5x~2x🎉faster vs SDPA EA.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak⚡️ performance
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A highly optimized LLM inference acceleration engine for Llama and its variants.
Multi-platform high-performance compute language extension for Rust.
Best practices & guides on how to write distributed pytorch training code
A fast communication-overlapping library for tensor parallelism on GPUs.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
From zero to hero CUDA for accelerating maths and machine learning on GPU.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
PyTorch native quantization and sparsity for training and inference
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
SGLang is a fast serving framework for large language models and vision language models.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
PyTorch native quantization and sparsity for training and inference
PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
Gradio based tool to run opensource LLM models directly from Huggingface
Official implementation of "Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting" (https://arxiv.org/abs/2405.06419)
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
A collection of GTSAM factors and optimizers for point cloud SLAM