Trending repositories for topic cuda
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Burn is a next generation Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
A Python framework for accelerated simulation, data generation and spatial computing.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification,...
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
SGLang is a fast serving framework for large language models and vision language models.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Burn is a next generation Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
A Python framework for accelerated simulation, data generation and spatial computing.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification,...
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
A collection of some awesome public MAX platform, Mojo programming language and Multi-Level IR Compiler Framework(MLIR) projects.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
YOLOv12 Inference Using CPP, Tensorrt, And CUDA
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
SGLang is a fast serving framework for large language models and vision language models.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
An efficient, user-friendly solver for nonlinear light-matter interaction
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Burn is a next generation Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Multi-platform high-performance compute language extension for Rust.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
A Python framework for accelerated simulation, data generation and spatial computing.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
Multi-platform high-performance compute language extension for Rust.
A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification,...
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
OpenEquivariance: a fast, open-source GPU JIT kernel generator for the Clebsch-Gordon Tensor Product.
a real-time N-body simulation with the Barnes-Hut algorithm and CUDA
Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering (core library)
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
Neighborhood Attention Extension. Bringing attention to a neighborhood near you!
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Ramalama is an open-source developer tool that simplifies the local serving of AI models from any source and facilitates their use for inference in production, all through the familiar language of con...
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Multi-platform high-performance compute language extension for Rust.
A highly optimized LLM inference acceleration engine for Llama and its variants.
Best practices & guides on how to write distributed pytorch training code
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
SplatAD: Real-Time Lidar and Camera Rendering with 3D Gaussian Splatting for Autonomous Driving
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Burn is a next generation Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
A Python framework for accelerated simulation, data generation and spatial computing.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
PyTorch native quantization and sparsity for training and inference
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Self-host the powerful Dia TTS model. This server offers a user-friendly Web UI, flexible API endpoints (incl. OpenAI compatible), support for SafeTensors/BF16, voice cloning, dialogue generation, and...
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
OpenEquivariance: a fast, open-source GPU JIT kernel generator for the Clebsch-Gordon Tensor Product.
A collection of GTSAM factors and optimizers for point cloud SLAM
DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
PyTorch native quantization and sparsity for training and inference
📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.