Statistics for topic llm-serving
RepositoryStats tracks 633,915 Github repositories, of these 39 are tagged with the llm-serving topic. The most common primary language for repositories using this topic is Python (20).
Stargazers over time for topic llm-serving
Most starred repositories for topic llm-serving (view more)
Trending repositories for topic llm-serving (view more)
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Community maintained hardware plugin for vLLM on Ascend
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
SGLang is a fast serving framework for large language models and vision language models.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Community maintained hardware plugin for vLLM on Ascend
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
SGLang is a fast serving framework for large language models and vision language models.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation
Community maintained hardware plugin for vLLM on Ascend
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an LLM (with low latency overhead!)
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
SGLang is a fast serving framework for large language models and vision language models.
High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.
A highly optimized LLM inference acceleration engine for Llama and its variants.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
Community maintained hardware plugin for vLLM on Ascend
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation
High-speed and easy-use LLM serving framework for local deployment