Trending repositories for topic cuda
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
PyTorch native quantization and sparsity for training and inference
NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Multi-platform high-performance compute language extension for Rust.
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
A throughput-oriented high-performance serving framework for LLMs
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
Cross-architecture parallel algorithms for Julia's GPU backends, from a unified KernelAbstractions.jl codebase. Targets Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
A fast communication-overlapping library for tensor parallelism on GPUs.
Multi-platform high-performance compute language extension for Rust.
PyTorch native quantization and sparsity for training and inference
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, TensorRT and High Performance Computing (HPC) projects.
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
A throughput-oriented high-performance serving framework for LLMs
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
A high-throughput and memory-efficient inference and serving engine for LLMs
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
PyTorch native quantization and sparsity for training and inference
NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Multi-platform high-performance compute language extension for Rust.
Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
A throughput-oriented high-performance serving framework for LLMs
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
Run AI models locally on your machine with node.js bindings for llama.cpp. Enforce a JSON schema on the model output on the generation level
Cross-architecture parallel algorithms for Julia's GPU backends, from a unified KernelAbstractions.jl codebase. Targets Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA.
A VR viewer for gaussian splatting models developped as native plugin for unity with the original CUDA rasterizer.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
A fast communication-overlapping library for tensor parallelism on GPUs.
Multi-platform high-performance compute language extension for Rust.
PyTorch native quantization and sparsity for training and inference
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, TensorRT and High Performance Computing (HPC) projects.
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
A throughput-oriented high-performance serving framework for LLMs
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Best practices & guides on how to write distributed pytorch training code
PyTorch native quantization and sparsity for training and inference
A retargetable MLIR-based machine learning compiler and runtime toolkit.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Best practices & guides on how to write distributed pytorch training code
YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
Cross-architecture parallel algorithms for Julia's GPU backends, from a unified KernelAbstractions.jl codebase. Targets Intel oneAPI, AMD ROCm, Apple Metal, Nvidia CUDA.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
Templated C++/CUDA implementation of Model Predictive Path Integral Control (MPPI)
A collection of GTSAM factors and optimizers for point cloud SLAM
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
PyTorch native quantization and sparsity for training and inference
A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Multi-platform high-performance compute language extension for Rust.
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
DeepStream Libraries offer CVCUDA, NvImageCodec, and PyNvVideoCodec modules as Python APIs for seamless integration into custom frameworks.
Speed up image preprocess with cuda when handle image or tensorrt inference
SGLang is a fast serving framework for large language models and vision language models.
Multi-platform high-performance compute language extension for Rust.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Run serverless workloads with fast cold starts on bare-metal servers, anywhere in the world
Best practices & guides on how to write distributed pytorch training code
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
A fast communication-overlapping library for tensor parallelism on GPUs.
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
From zero to hero CUDA for accelerating maths and machine learning on GPU.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
PyTorch native quantization and sparsity for training and inference
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
OneDiff: An out-of-the-box acceleration library for diffusion models.
PyTorch native quantization and sparsity for training and inference
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
SGLang is a fast serving framework for large language models and vision language models.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
A high-performance inference system for large language models, designed for production environments.
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
Gradio based tool to run opensource LLM models directly from Huggingface
Official implementation of "Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting" (https://arxiv.org/abs/2405.06419)
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
zkDL, an open source toolkit for zero-knowledge proofs of deep learning powered by CUDA
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.