Search Results - RepositoryStats

97

1.4k

apache-2.0

25

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

llm vit cuda mlsys triton attention llm-infra quantization video-generate video-generation efficient-attention inference-acceleration

Created 2024-10-03

87 commits to main branch, last one 4 days ago

TeaCache ali-vilab

27

726

apache-2.0

8

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

latte cogvideox open-sora hunyuan-video open-sora-plan diffusion-models video-generation inference-acceleration

Created 2024-11-27

104 commits to main branch, last one 7 days ago

SpargeAttn thu-ml

30

491

apache-2.0

6

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

llm mlsys ai-infra attention quantization sageattention sparse-attention vision-transformer inference-acceleration

Created 2025-02-25

46 commits to main branch, last one 3 days ago

AsyncDiff czg1225

12

197

apache-2.0

3

[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

text-to-image text-to-video training-free diffusion-models stable-diffusion efficient-inference distributed-computing inference-acceleration

Created 2024-05-31

64 commits to main branch, last one 2 months ago

nos autonomi-ai

12

142

apache-2.0

1

⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.

inference generative-ai llm-inference computer-vision machine-learning inference-acceleration

Created 2023-04-16

358 commits to main branch, last one 10 months ago

Q-LLM dvlab-research

3

49

unknown

0

This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"

long-context fast-inference kv-cache-compression large-language-models inference-acceleration

Created 2024-06-11

8 commits to master branch, last one 9 months ago