Trending repositories for language Cuda

Last 3 days (new repositories)

no newly created repositories trending in the last 3 days

Last 3 days (absolute gain)

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204 (+19)

mit

a-hamdi/cuda

100 days of building Cuda kernels!

138 (+16)

mit

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+16)

gpl-3.0

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+13)

apache-2.0

NVlabs/instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

16,243 (+11)

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+8)

apache-2.0

NVIDIA/CUDALibrarySamples

CUDA Library Samples

1,742 (+7)

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

1,857 (+7)

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+5)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+5)

apache-2.0

rapidsai/cugraph

cuGraph - RAPIDS Graph Analytics Library

1,847 (+4)

apache-2.0

Infatoshi/cuda-course

No description

830 (+4)

xgqdut2016/cuda_code

easy cuda code

58 (+3)

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692 (+3)

apache-2.0

HigherOrderCO/HVM

A massively parallel, optimal functional runtime in Rust

10,662 (+3)

apache-2.0

1y33/100Days

CUDA Kernels

108 (+2)

likejazz/llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

322 (+2)

mit

rapidsai/raft

RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-accelerated and form building blocks for more easily writing hig...

827 (+2)

apache-2.0

Alisah-Ozcan/HEonGPU

HEonGPU is a high-performance library that optimizes Fully Homomorphic Encryption (FHE) on GPUs. Leveraging GPU parallelism, it reduces computational load through concurrent execution. Its multi-strea...

37 (+1)

apache-2.0

microsoft/TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

45 (+1)

mit

Last 3 days (relative gain)

a-hamdi/cuda

100 days of building Cuda kernels!

138 (+13%)

mit

xgqdut2016/cuda_code

easy cuda code

58 (+5%)

Alisah-Ozcan/HEonGPU

37 (+3%)

apache-2.0

microsoft/TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

45 (+2%)

mit

1y33/100Days

CUDA Kernels

108 (+2%)

tgautam03/xGeMM

Accelerated General (FP32) Matrix Multiplication from scratch in CUDA

90 (+1%)

mit

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+0.9%)

apache-2.0

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+0.8%)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+0.8%)

apache-2.0

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+0.7%)

gpl-3.0

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+0.7%)

apache-2.0

CisMine/Parallel-Computing-Cuda-C

CUDA Learning guide

303 (+0.7%)

likejazz/llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

322 (+0.6%)

mit

codereport/array-language-comparisons

A comparison of array languages & libraries: APL, J, BQN, Uiua, Q, Julia, R, NumPy, Nial, Futhark, Dex, Ivy, SaC & ArrayFire.

163 (+0.6%)

mit

pyscf/gpu4pyscf

A plugin to use Nvidia GPU in PySCF package

169 (+0.6%)

apache-2.0

Infatoshi/cuda-course

No description

830 (+0.5%)

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692 (+0.4%)

apache-2.0

NVIDIA/CUDALibrarySamples

CUDA Library Samples

1,742 (+0.4%)

wangzyon/NVIDIA_SGEMM_PRACTICE

Step-by-step optimization of CUDA SGEMM

278 (+0.4%)

rapidsai/cuvs

cuVS - a library for vector search and clustering on the GPU

282 (+0.4%)

apache-2.0

Last week (new repositories)

no newly created repositories trending in the last week

Last week (absolute gain)

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204 (+59)

mit

a-hamdi/cuda

100 days of building Cuda kernels!

138 (+51)

mit

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+44)

gpl-3.0

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+38)

apache-2.0

1y33/100Days

CUDA Kernels

108 (+21)

Infatoshi/cuda-course

No description

830 (+19)

NVlabs/instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

16,243 (+16)

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+16)

apache-2.0

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+12)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+12)

apache-2.0

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

1,857 (+10)

NVIDIA/CUDALibrarySamples

CUDA Library Samples

1,742 (+9)

HazyResearch/ThunderKittens

Tile primitives for speedy kernels

1,973 (+9)

mit

HigherOrderCO/HVM

A massively parallel, optimal functional runtime in Rust

10,662 (+9)

apache-2.0

olcf/cuda-training-series

Training materials associated with NVIDIA's CUDA Training Series (www.olcf.ornl.gov/cuda-training-series/)

675 (+9)

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692 (+7)

apache-2.0

rapidsai/cugraph

cuGraph - RAPIDS Graph Analytics Library

1,847 (+6)

apache-2.0

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

340 (+5)

mit

Alisah-Ozcan/HEonGPU

37 (+4)

apache-2.0

xgqdut2016/cuda_code

easy cuda code

58 (+4)

Last week (relative gain)

a-hamdi/cuda

100 days of building Cuda kernels!

138 (+59%)

mit

1y33/100Days

CUDA Kernels

108 (+24%)

Alisah-Ozcan/HEonGPU

37 (+12%)

apache-2.0

xgqdut2016/cuda_code

easy cuda code

58 (+7%)

cavemanloverboy/vanity

No description

63 (+5%)

apache-2.0

abhisheknair10/llama3.cu

Lightweight Llama 3 8B Inference Engine in CUDA C

45 (+5%)

mit

DefTruth/ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

75 (+4%)

gpl-3.0

Tongkaio/CUDA_Kernel_Samples

CUDA 算子手撕与面试指南

107 (+4%)

mit

Infatoshi/cuda-course

No description

830 (+2%)

microsoft/TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

45 (+2%)

mit

BorealisAI/neuzip

Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repository contains the code for the experiments in the paper.

48 (+2%)

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+2%)

gpl-3.0

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+2%)

apache-2.0

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+2%)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+2%)

apache-2.0

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+2%)

apache-2.0

R100001/Programming-Massively-Parallel-Processors

No description

118 (+2%)

pranjalssh/fast.cu

Fastest kernels written from scratch

132 (+2%)

mit

Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

340 (+1%)

mit

Infatoshi/mnist-cuda

No description

211 (+1%)

Last month (new repositories)

a-hamdi/cuda

100 days of building Cuda kernels!

138

mit

1y33/100Days

CUDA Kernels

108

salykova/sgemm.cu

SGEMM that beats cuBLAS

mit

Last month (absolute gain)

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+335)

gpl-3.0

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204 (+278)

mit

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+227)

apache-2.0

a-hamdi/cuda

100 days of building Cuda kernels!

138 (+135)

mit

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+124)

apache-2.0

1y33/100Days

CUDA Kernels

108 (+107)

Infatoshi/cuda-course

No description

830 (+105)

Maharshi-Pandya/cudacodes

Learnings and programs related to CUDA

117 (+91)

apache-2.0

NVlabs/instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

16,243 (+89)

HazyResearch/ThunderKittens

Tile primitives for speedy kernels

1,973 (+83)

mit

HigherOrderCO/HVM

A massively parallel, optimal functional runtime in Rust

10,662 (+78)

apache-2.0

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

1,857 (+58)

DefTruth/ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

75 (+54)

gpl-3.0

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+52)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+47)

apache-2.0

NVIDIA/CUDALibrarySamples

CUDA Library Samples

1,742 (+45)

salykova/sgemm.cu

SGEMM that beats cuBLAS

70 (+41)

mit

pranjalssh/fast.cu

Fastest kernels written from scratch

132 (+41)

mit

Infatoshi/mnist-cuda

No description

211 (+37)

efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

715 (+34)

apache-2.0

Last month (relative gain)

Maharshi-Pandya/cudacodes

Learnings and programs related to CUDA

117 (+350%)

apache-2.0

DefTruth/ffpa-attn-mma

📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.

75 (+257%)

gpl-3.0

abhisheknair10/llama3.cu

Lightweight Llama 3 8B Inference Engine in CUDA C

45 (+181%)

mit

salykova/sgemm.cu

SGEMM that beats cuBLAS

70 (+141%)

mit

xgqdut2016/cuda_code

easy cuda code

58 (+53%)

vaktibabat/cudacracker

MD5 hash cracking with CUDA and Rust, implemented from scratch

33 (+50%)

gpl-3.0

pranjalssh/fast.cu

Fastest kernels written from scratch

132 (+45%)

mit

CisMine/Guide-NVIDIA-Tools

NVIDIA tools guide

98 (+34%)

Alisah-Ozcan/HEonGPU

37 (+28%)

apache-2.0

Tongkaio/CUDA_Kernel_Samples

CUDA 算子手撕与面试指南

107 (+26%)

mit

microsoft/TileFusion

TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.

45 (+25%)

mit

leimao/CUTLASS-Examples

CUTLASS and CuTe Examples

37 (+23%)

bsd-3-clause

Infatoshi/mnist-cuda

No description

211 (+21%)

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+18%)

gpl-3.0

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+16%)

apache-2.0

test-time-training/ttt-lm-kernels

Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden States

55 (+15%)

cavemanloverboy/vanity

No description

63 (+15%)

apache-2.0

Infatoshi/cuda-course

No description

830 (+14%)

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+13%)

apache-2.0

jvhs0706/zkllm-ccs2024

No description

68 (+13%)

Last 12-months (new repositories)

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204

mit

HazyResearch/ThunderKittens

Tile primitives for speedy kernels

1,973

mit

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910

apache-2.0

Infatoshi/cuda-course

No description

830

efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

715

apache-2.0

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692

apache-2.0

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618

apache-2.0

clu0/unet.cu

UNet diffusion model in pure CUDA

597

XuezheMax/megalodon

Reference implementation of Megalodon 7B model

512

mit

likejazz/llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

322

mit

b0nes164/GPUSorting

State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

263

mit-han-lab/Quest

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

236

usyd-fsalab/fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

232

apache-2.0

Infatoshi/mnist-cuda

No description

211

pointrix-project/msplat

A modular differential gaussian rasterization library.

186

HMUNACHI/cuda-repo

From zero to hero CUDA for accelerating maths and machine learning on GPU.

175

mit

LetianHuang/op43dgs

[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy

159

rmurai0610/diff-gaussian-rasterization-w-pose

No description

155

yanchi-3dv/diff-gaussian-rasterization-for-gsslam

The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.

145

a-hamdi/cuda

100 days of building Cuda kernels!

138

mit

Last 12-months (absolute gain)

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204 (+23,861)

mit

HigherOrderCO/HVM

A massively parallel, optimal functional runtime in Rust

10,662 (+3,834)

apache-2.0

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+2,077)

gpl-3.0

HazyResearch/ThunderKittens

Tile primitives for speedy kernels

1,973 (+1,972)

mit

flashinfer-ai/flashinfer

FlashInfer: Kernel Library for LLM Serving

1,911 (+1,721)

apache-2.0

NVlabs/instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

16,243 (+1,343)

BBuf/how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

1,857 (+1,064)

thu-ml/SageAttention

Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

910 (+908)

apache-2.0

Infatoshi/cuda-course

No description

830 (+824)

efeslab/Nanoflow

A throughput-oriented high-performance serving framework for LLMs

715 (+713)

apache-2.0

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692 (+687)

apache-2.0

NVIDIA/CUDALibrarySamples

CUDA Library Samples

1,742 (+664)

Tony-Tan/CUDA_Freshman

No description

2,292 (+631)

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+568)

apache-2.0

graphdeco-inria/diff-gaussian-rasterization

No description

997 (+545)

XuezheMax/megalodon

Reference implementation of Megalodon 7B model

512 (+511)

mit

clu0/unet.cu

UNet diffusion model in pure CUDA

597 (+502)

siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch

610 (+416)

mit

brucefan1983/CUDA-Programming

Sample codes for my CUDA programming book

1,627 (+392)

gpl-3.0

NVIDIA/nccl-tests

NCCL Tests

975 (+375)

bsd-3-clause

Last 12-months (relative gain)

tspeterkim/flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

692 (+13,740%)

apache-2.0

Infatoshi/cuda-course

No description

830 (+13,733%)

usyd-fsalab/fp6_llm

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

232 (+3,767%)

apache-2.0

likejazz/llama3.cuda

llama3.cuda is a pure C/CUDA implementation for Llama 3 model.

322 (+3,120%)

mit

bytedance/decoupleQ

A quantization algorithm for LLM

109 (+2,625%)

apache-2.0

yanchi-3dv/diff-gaussian-rasterization-for-gsslam

The modified differential Gaussian rasterization in the CVPR 2024 highlight paper: GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting.

145 (+2,317%)

rmurai0610/diff-gaussian-rasterization-w-pose

No description

155 (+2,114%)

DefTruth/CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

2,192 (+1,806%)

gpl-3.0

karpathy/llm.c

LLM training in simple, raw C/CUDA

25,204 (+1,777%)

mit

leo-frank/diff-gaussian-rasterization-depth

Differential Gaussian Rasterization with Depth forward and backward functionality

150 (+1,775%)

Maharshi-Pandya/cudacodes

Learnings and programs related to CUDA

117 (+1,571%)

apache-2.0

encryptorion-lab/phantom-fhe

PhantomFHE: A CUDA-Accelerated Homomorphic Encryption Library

97 (+1,517%)

gpl-3.0

R100001/Programming-Massively-Parallel-Processors

No description

118 (+1,375%)

KemengHuang/GPU_IPC

This is the first fully GPU Optimized IPC framework

116 (+1,350%)

mpl-2.0

LetianHuang/op43dgs

[ECCV'24] On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy

159 (+1,345%)

ifromeast/cuda_learning

learning how CUDA works

190 (+1,257%)

rahul-goel/fused-ssim

Lightning fast differentiable SSIM.

91 (+1,200%)

mit

mit-han-lab/nunchaku

[ICLR2025] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

618 (+1,136%)

apache-2.0

rapidsai/cuvs

cuVS - a library for vector search and clustering on the GPU

282 (+1,126%)

apache-2.0

pranjalssh/fast.cu

Fastest kernels written from scratch

132 (+1,100%)

mit