4 results found Sort:
- Filter by Primary Language:
- Python (3)
- Cuda (1)
- +
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Created
2024-10-03
64 commits to main branch, last one 22 hours ago
[NeurIPS 2024] AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Created
2024-05-31
63 commits to main branch, last one 2 months ago
⚡️ A fast and flexible PyTorch inference server that runs locally, on any cloud or AI HW.
Created
2023-04-16
358 commits to main branch, last one 6 months ago
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
Created
2024-06-11
8 commits to master branch, last one 5 months ago