Statistics for topic cuda
RepositoryStats tracks 628,170 Github repositories, of these 682 are tagged with the cuda topic. The most common primary language for repositories using this topic is C++ (249). Other languages include: Python (153), Cuda (81), C (34), Jupyter Notebook (26), Rust (19), Dockerfile (14), Shell (13)
Stargazers over time for topic cuda
Most starred repositories for topic cuda (view more)
Trending repositories for topic cuda (view more)
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
[CVPR 2025] EnvGS: Modeling View-Dependent Appearance with Environment Gaussian. Including a fully differentiable 2D Gaussian ray tracer built on 2DGS and OptiX, supporting multiple-bounce path tracin...
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A high-throughput and memory-efficient inference and serving engine for LLMs
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
SGLang is a fast serving framework for large language models and vision language models.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
YOLOv12 Inference Using CPP, Tensorrt, And CUDA
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A GPU-accelerated library for Tree-based Genetic Programming, leveraging PyTorch and custom CUDA kernels for high-performance evolutionary computation. It supports symbolic regression, classification,...
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Multi-platform high-performance compute language extension for Rust.
A highly optimized LLM inference acceleration engine for Llama and its variants.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
模型部署白皮书(CUDA|ONNX|TensorRT|C++)🚀🚀🚀