5 results found Sort:

📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Created 2023-08-27
405 commits to main branch, last one 7 days ago
152
1.4k
gpl-3.0
13
🎉 Modern CUDA Learn Notes with PyTorch: CUDA Cores, Tensor Cores, fp32/tf32, fp16/bf16, fp8/int8, flash_attn, rope, sgemm, hgemm, sgemv, warp/block reduce, elementwise, softmax, layernorm, rmsnorm.
Created 2022-12-17
299 commits to main branch, last one 2 days ago
31
187
mit
4
Shush is an app that deploys a WhisperV3 model with Flash Attention v2 on Modal and makes requests to it via a NextJS app
Created 2023-11-18
64 commits to main branch, last one 5 months ago
Triton implementation of FlashAttention2 that adds Custom Masks.
Created 2024-07-20
18 commits to main branch, last one 2 months ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created 2023-08-16
1 commits to master branch, last one 2 months ago