6 results found Sort:

97
1.4k
apache-2.0
25
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Created 2024-10-03
87 commits to main branch, last one 4 days ago
27
726
apache-2.0
8
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
Created 2024-11-27
104 commits to main branch, last one 7 days ago
30
491
apache-2.0
6
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
Created 2025-02-25
46 commits to main branch, last one 3 days ago
12
197
apache-2.0
3
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Created 2024-05-31
64 commits to main branch, last one 2 months ago
12
142
apache-2.0
1
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
Created 2023-04-16
358 commits to main branch, last one 10 months ago
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
Created 2024-06-11
8 commits to master branch, last one 9 months ago