Search Results - RepositoryStats

275

3.9k

gpl-3.0

122

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.

mla vllm deepseek flash-mla minimax-01 awesome-llm deepseek-r1 deepseek-v3 tensorrt-llm llm-inference flash-attention paged-attention flash-attention-3

Created 2023-08-27

471 commits to main branch, last one 4 days ago

TransMLA fxmeng

21

237

mit

4

TransMLA: Multi-Head Latent Attention Is All You Need

mla deepseek attention

Created 2025-01-02

15 commits to main branch, last one about a month ago

plenopticam hahnec

37

217

gpl-3.0

8

Light-field imaging application for plenoptic cameras

Created 2019-03-30

1,555 commits to master branch, last one about a year ago

ffpa-attn-mma xlite-dev

7

169

gpl-3.0

3

📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.

mla cuda sdpa mlsys deepseek attention flash-mla fused-mla deepseek-r1 deepseek-v3 tensor-cores flash-attention

Created 2024-11-29

247 commits to main branch, last one 28 days ago

Palu shadowpa0327

4

99

mit

4

[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection

mla deepseek kv-cache-compression kv-cache-quantization

Created 2024-07-02

42 commits to master branch, last one 2 months ago

decoding_attention Bruce-Lee-LY

3

36

bsd-3-clause

2

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu gqa llm mha mla mqa cuda nvidia flashmla cuda-core inference flashinfer flash-attention decoding-attention large-language-model multi-head-attention

Created 2024-08-14

2 commits to master branch, last one 20 days ago