6 results found Sort:

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
Created 2023-08-27
471 commits to main branch, last one 4 days ago
21
237
mit
4
TransMLA: Multi-Head Latent Attention Is All You Need
Created 2025-01-02
15 commits to main branch, last one about a month ago
37
217
gpl-3.0
8
Light-field imaging application for plenoptic cameras
Created 2019-03-30
1,555 commits to master branch, last one about a year ago
📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.
Created 2024-11-29
247 commits to main branch, last one 28 days ago
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Created 2024-07-02
42 commits to master branch, last one 2 months ago
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Created 2024-08-14
2 commits to master branch, last one 20 days ago