Trending repositories for topic cuda
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
🚀🚀🚀 A collection of some awesome public YOLO object detection series projects.
OneDiff: An out-of-the-box acceleration library for diffusion models.
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.
An efficient, user-friendly solver for nonlinear light-matter interaction
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
SGLang is a fast serving framework for large language models and vision language models.
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
fdtd3d is an open source 1D, 2D, 3D FDTD electromagnetics solver with MPI, OpenMP and CUDA support for x64, ARM, ARM64, RISC-V, PowerPC, Wasm architectures
Latest hashcat docker for CUDA, OpenCL, and POCL. Deployed on Vast.ai
This is a project, where I give you a way to use SOLIDWORKS on Linux!
CUDA implementation of Hierarchical Navigable Small World Graph algorithm
Docker Image for Ubuntu Desktop which support HW GPU accelerated GUI apps. you can access the Container with ssh or remote desktop, just like Cloud VM.
AUTOMATIC1111/stable-diffusion-webui for CUDA and ROCm on NixOS
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
PyTorch native quantization and sparsity for training and inference
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
An efficient, user-friendly solver for nonlinear light-matter interaction
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.
The PennyLane-Lightning plugin provides a fast state-vector simulator written in C++ for use with PennyLane
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Speed up image preprocess with cuda when handle image or tensorrt inference
A fast communication-overlapping library for tensor parallelism on GPUs.
Efficient CUDA kernels for training convolutional neural networks with PyTorch.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
PyTorch native quantization and sparsity for training and inference
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization i...
Multi-platform high-performance compute language extension for Rust.
Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.
Unbiased & physically-based GPU HIPRT (C++/HIP) interactive path tracing renderer
Best practices & guides on how to write distributed pytorch training code
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu
A nearly complete collection of prefix sum algorithms implemented in CUDA, D3D12, Unity and WGPU. Theoretically portable to all wave/warp/subgroup sizes.
Multi-platform high-performance compute language extension for Rust.
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
A nvImageCodec library of GPU- and CPU- accelerated codecs featuring a unified interface
A collection of GTSAM factors and optimizers for point cloud SLAM
SGLang is a fast serving framework for large language models and vision language models.
Multi-platform high-performance compute language extension for Rust.
🚀 你的YOLO部署神器。TensorRT Plugin、CUDA Kernel、CUDA Graphs三管齐下,享受闪电般的推理速度。| Your YOLO Deployment Powerhouse. With the synergy of TensorRT Plugins, CUDA Kernels, and CUDA Graphs, experience lightning-fast i...
Best practices & guides on how to write distributed pytorch training code
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
A fast communication-overlapping library for tensor parallelism on GPUs.
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
From zero to hero CUDA for accelerating maths and machine learning on GPU.
3DGS-LM accelerates Gaussian-Splatting optimization by replacing the ADAM optimizer with Levenberg-Marquardt.
State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
A CUDA reimplementation of the line/plane odometry of LIO-SAM. A point cloud hash map (inspired by iVox of Faster-LIO) on GPU is used to accelerate 5-neighbour KNN search.
A Fully Homomorphic Encryption (FHE) library for bridging the gap between theory and practice with a focus on performance and accuracy.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.
PyTorch native quantization and sparsity for training and inference
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
OneDiff: An out-of-the-box acceleration library for diffusion models.
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
SGLang is a fast serving framework for large language models and vision language models.
PyTorch native quantization and sparsity for training and inference
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀
NviWatch: A blazingly fast rust based TUI for managing and monitoring NVIDIA GPU processes
Gradio based tool to run opensource LLM models directly from Huggingface
Official implementation of "Time Evidence Fusion Network: Multi-source View in Long-Term Time Series Forecasting" (https://arxiv.org/abs/2405.06419)
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
A collection of GTSAM factors and optimizers for point cloud SLAM
zkDL, an open source toolkit for zero-knowledge proofs of deep learning powered by CUDA