3 results found Sort:
A fast communication-overlapping library for tensor parallelism on GPUs.
Created
2024-03-01
24 commits to main branch, last one about a month ago
Examples of CUDA implementations by Cutlass CuTe
Created
2024-04-28
26 commits to main branch, last one 9 days ago
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Created
2023-08-16
1 commits to master branch, last one 2 months ago