Trending repositories for language Cuda
From zero to hero CUDA for accelerating maths and machine learning on GPU.
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
flash attention tutorial written in python, triton, cuda, cutlass
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python API and fully integrated with PyTorch
From zero to hero CUDA for accelerating maths and machine learning on GPU.
flash attention tutorial written in python, triton, cuda, cutlass
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python API and fully integrated with PyTorch
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
A differentiable rasterizer used in the project "2D Gaussian Splatting"
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Tutorials for writing high-performance GPU operators in AI frameworks.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
From zero to hero CUDA for accelerating maths and machine learning on GPU.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
flash attention tutorial written in python, triton, cuda, cutlass
Neighborhood Attention Extension. Bringing attention to a neighborhood near you!
From zero to hero CUDA for accelerating maths and machine learning on GPU.
flash attention tutorial written in python, triton, cuda, cutlass
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
A differentiable rasterizer used in the project "2D Gaussian Splatting"
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python API and fully integrated with PyTorch
From zero to hero CUDA for accelerating maths and machine learning on GPU.
Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)
A massively parallel, optimal functional runtime in Rust
From zero to hero CUDA for accelerating maths and machine learning on GPU.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
flash attention tutorial written in python, triton, cuda, cutlass
A differentiable rasterizer used in the project "2D Gaussian Splatting"
Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)
flash attention tutorial written in python, triton, cuda, cutlass
The simplest but fast implementation of matrix multiplication in CUDA.
A differentiable rasterizer used in the project "2D Gaussian Splatting"
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
comprehensive library of 3D transmission Computed Tomography (CT) algorithms with Python API and fully integrated with PyTorch
OneSweep, implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.
LightwheelOcc: A 3D Occupancy Synthetic Dataset in Autonomous Driving
Flash Attention in ~100 lines of CUDA (forward pass only)
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
An implementation of the transformer architecture onto an Nvidia CUDA kernel
Multithreaded matrix multiplication and analysis based on OpenMP and PThread
A massively parallel, optimal functional runtime in Rust
CUDA accelerated rasterization of gaussian splatting
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Flash Attention in ~100 lines of CUDA (forward pass only)
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sg...
RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing hig...
[MICRO'23, MLSys'22] TorchSparse: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
Flash Attention in ~100 lines of CUDA (forward pass only)
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
Causal depthwise conv1d in CUDA, with a PyTorch interface
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
The deployment of Yolov8-seg on Jetson AGX Xavier(带低光照补偿的yolov8检测分割模型)