6 results found Sort:
- Filter by Primary Language:
- Python (4)
- C++ (1)
- Cuda (1)
- +
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
Created
2023-08-27
471 commits to main branch, last one 4 days ago
TransMLA: Multi-Head Latent Attention Is All You Need
Created
2025-01-02
15 commits to main branch, last one about a month ago
Light-field imaging application for plenoptic cameras
Created
2019-03-30
1,555 commits to master branch, last one about a year ago
📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.
Created
2024-11-29
247 commits to main branch, last one 28 days ago
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Created
2024-07-02
42 commits to master branch, last one 2 months ago
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Created
2024-08-14
2 commits to master branch, last one 20 days ago