Search Results - RepositoryStats

Qwen QwenLM

1.5k

18.0k

apache-2.0

136

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

llm chinese flash-attention pretrained-models large-language-models natural-language-processing

Created 2023-08-03

505 commits to main branch, last one 29 days ago

Chinese-LLaMA-Alpaca-2 ymcui

571

7.2k

apache-2.0

78

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

64k llm nlp rlhf yarn llama alpaca llama2 alpaca2 llama-2 alpaca-2 flash-attention large-language-models

Created 2023-07-18

264 commits to main branch, last one 7 months ago

InternLM InternLM

484

6.9k

apache-2.0

58

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

gpt llm rlhf chatbot chinese long-context fine-tuning-llm flash-attention pretrained-models large-language-model

Created 2023-07-06

245 commits to main branch, last one 2 months ago

Awesome-LLM-Inference xlite-dev

275

3.9k

gpl-3.0

122

📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.

mla vllm deepseek flash-mla minimax-01 awesome-llm deepseek-r1 deepseek-v3 tensorrt-llm llm-inference flash-attention paged-attention flash-attention-3

Created 2023-08-27

475 commits to main branch

LeetCUDA xlite-dev

393

3.6k

gpl-3.0

26

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥

cuda hgemm cuda-demo leet-cuda learn-cuda cuda-kernel cuda-kernels cuda-toolkit flash-attention cuda-programming

Created 2022-12-17

553 commits to main branch, last one 8 hours ago

flashinfer flashinfer-ai

286

2.7k

apache-2.0

31

FlashInfer: Kernel Library for LLM Serving

gpu jit cuda pytorch llm-inference flash-attention flashinfer-python large-large-models

Created 2023-07-22

1,062 commits to main branch, last one a day ago

MoBA MoonshotAI

104

1.8k

mit

25

MoBA: Mixture of Block Attention for Long-Context LLMs

llm moe pytorch llm-serving transformer llm-training flash-attention

Created 2025-02-17

13 commits to master branch, last one 23 days ago

InternEvo InternLM

64

382

apache-2.0

10

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

910b gemma llava zero3 llama3 pytorch internlm internlm2 multi-modal llm-training llm-framework ring-attention flash-attention deepspeed-ulysses tensor-parallelism transformers-models pipeline-parallelism sequence-parallelism

Created 2024-01-16

511 commits to develop branch, last one about a month ago

Inf-CLIP DAMO-NLP-SG

11

242

apache-2.0

5

[CVPR 2025 Highlight] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training schem...

clip ring-attention flash-attention memory-efficient infinite-batch-size contrastive-learning

Created 2024-10-16

30 commits to main branch, last one 3 months ago

ffpa-attn-mma xlite-dev

7

171

gpl-3.0

3

📚FFPA(Split-D): Yet another Faster Flash Attention with O(1) GPU SRAM complexity large headdim, 1.8x~3x↑🎉 faster than SDPA EA.

mla cuda sdpa mlsys deepseek attention flash-mla fused-mla deepseek-r1 deepseek-v3 tensor-cores flash-attention

Created 2024-11-29

247 commits to main branch, last one about a month ago

flashattention2-custom-mask alexzhang13

11

109

apache-2.0

4

Triton implementation of FlashAttention2 that adds Custom Masks.

triton attention triton-lang cuda-kernels deep-learning flash-attention flash-attention-2 attention-mechanism

Created 2024-07-20

18 commits to main branch, last one 8 months ago

gdGPT CoinCheung

8

95

apache-2.0

1

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

llm nlp bloom llama2 pytorch pipeline deepspeed chatglm3-6b baichuan2-7b mixtral-8x7b full-finetune flash-attention model-parallization

Created 2023-06-24

27 commits to master branch, last one about a year ago

decoding_attention Bruce-Lee-LY

3

36

bsd-3-clause

2

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

gpu gqa llm mha mla mqa cuda nvidia flashmla cuda-core inference flashinfer flash-attention decoding-attention large-language-model multi-head-attention

Created 2024-08-14

2 commits to master branch, last one 24 days ago

flash_attention_inference Bruce-Lee-LY

5

35

bsd-3-clause

1

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

gpu llm mha cuda nvidia cutlass inference tensor-core flash-attention flash-attention-2 large-language-model multi-head-attention

Created 2023-08-16

1 commits to master branch, last one about a month ago

FlashPerceiver kklemon

3

26

unknown

1

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

nlp perceiver transformer deep-learning flash-attention attention-mechanism

Created 2023-07-23

43 commits to master branch, last one 5 months ago