13 results found Sort:
- Filter by Primary Language:
- Python (8)
- C++ (2)
- Cuda (2)
- +
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Created
2023-08-03
501 commits to main branch, last one 17 days ago
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Created
2023-07-18
264 commits to main branch, last one 4 months ago
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
Created
2023-07-06
245 commits to main branch, last one 11 days ago
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
Created
2023-08-27
450 commits to main branch, last one 5 days ago
FlashInfer: Kernel Library for LLM Serving
Created
2023-07-22
971 commits to main branch, last one 11 hours ago
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Created
2024-01-16
501 commits to develop branch, last one 4 days ago
The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
Created
2024-10-16
30 commits to main branch, last one about a month ago
📚FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster than SDPA EA.
Created
2024-11-29
240 commits to main branch, last one a day ago
Triton implementation of FlashAttention2 that adds Custom Masks.
Created
2024-07-20
18 commits to main branch, last one 6 months ago
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Created
2023-06-24
27 commits to master branch, last one about a year ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created
2023-08-16
1 commits to master branch, last one 5 months ago
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
Created
2024-08-14
1 commits to master branch, last one 3 months ago
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Created
2023-07-23
43 commits to master branch, last one 3 months ago