Trending repositories for topic cuda
A high-throughput and memory-efficient inference and serving engine for LLMs
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
SGLang is a fast serving framework for large language models and vision language models.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
High-Performance Rendering Framework on Stream Architectures
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
A highly optimized LLM inference acceleration engine for Llama and its variants.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
High-Performance Rendering Framework on Stream Architectures
A highly optimized LLM inference acceleration engine for Llama and its variants.
Parallel, highly efficient code (CPU and GPU) for DEM and CFD-DEM simulations.
A fast communication-overlapping library for tensor parallelism on GPUs.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
A throughput-oriented high-performance serving framework for LLMs
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
A high-throughput and memory-efficient inference and serving engine for LLMs
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
SGLang is a fast serving framework for large language models and vision language models.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
A highly optimized LLM inference acceleration engine for Llama and its variants.
High-Performance Rendering Framework on Stream Architectures
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
A highly optimized LLM inference acceleration engine for Llama and its variants.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
High-Performance Cross-Platform Monte Carlo Renderer Based on LuisaCompute
High-Performance Rendering Framework on Stream Architectures
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Publish some small parts in my personal daily-used Houdini accessories
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
Original reference implementation of the CUDA rasterizer from the paper "StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering"
A highly optimized LLM inference acceleration engine for Llama and its variants.
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API (Write for Fun 👀~)
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
A highly optimized LLM inference acceleration engine for Llama and its variants.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
Original reference implementation of the CUDA rasterizer from the paper "StopThePop: Sorted Gaussian Splatting for View-Consistent Real-time Rendering"
Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.
Best practices & guides on how to write distributed pytorch training code
SGLang is a fast serving framework for large language models and vision language models.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Multi-platform high-performance compute language extension for Rust.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
A highly optimized LLM inference acceleration engine for Llama and its variants.
Best practices & guides on how to write distributed pytorch training code
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
A fast communication-overlapping library for tensor parallelism on GPUs.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
From zero to hero CUDA for accelerating maths and machine learning on GPU.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
PyTorch native quantization and sparsity for training and inference
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
SGLang is a fast serving framework for large language models and vision language models.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
PyTorch native quantization and sparsity for training and inference
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
Gradio based tool to run opensource LLM models directly from Huggingface
Official implementation of "Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting" (https://arxiv.org/abs/2405.06419)
A collection of GTSAM factors and optimizers for point cloud SLAM