Search Results - RepositoryStats

1 result found Sort:

bsd-3-clause

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu gqa llm mha mla mqa cuda nvidia flashmla cuda-core inference flashinfer flash-attention decoding-attention large-language-model multi-head-attention

Created 2024-08-14

2 commits to master branch, last one 15 days ago