15 results found Sort:

1.5k
18.0k
apache-2.0
136
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Created 2023-08-03
505 commits to main branch, last one 29 days ago
571
7.2k
apache-2.0
78
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Created 2023-07-18
264 commits to main branch, last one 7 months ago
484
6.9k
apache-2.0
58
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
Created 2023-07-06
245 commits to main branch, last one 2 months ago
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
Created 2023-08-27
475 commits to main branch
393
3.6k
gpl-3.0
26
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
Created 2022-12-17
553 commits to main branch, last one 8 hours ago
286
2.7k
apache-2.0
31
FlashInfer: Kernel Library for LLM Serving
Created 2023-07-22
1,062 commits to main branch, last one a day ago
104
1.8k
mit
25
MoBA: Mixture of Block Attention for Long-Context LLMs
Created 2025-02-17
13 commits to master branch, last one 23 days ago
64
382
apache-2.0
10
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Created 2024-01-16
511 commits to develop branch, last one about a month ago
11
242
apache-2.0
5
[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training schem...
Created 2024-10-16
30 commits to main branch, last one 3 months ago
📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.
Created 2024-11-29
247 commits to main branch, last one about a month ago
Triton implementation of FlashAttention2 that adds Custom Masks.
Created 2024-07-20
18 commits to main branch, last one 8 months ago
8
95
apache-2.0
1
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
Created 2023-06-24
27 commits to master branch, last one about a year ago
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
Created 2024-08-14
2 commits to master branch, last one 24 days ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created 2023-08-16
1 commits to master branch, last one about a month ago
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
Created 2023-07-23
43 commits to master branch, last one 5 months ago