Trending repositories for topic llm-serving
A high-throughput and memory-efficient inference and serving engine for LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
SGLang is a fast serving framework for large language models and vision language models.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
A throughput-oriented high-performance serving framework for LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A throughput-oriented high-performance serving framework for LLMs
SGLang is a fast serving framework for large language models and vision language models.
A high-throughput and memory-efficient inference and serving engine for LLMs
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
SGLang is a fast serving framework for large language models and vision language models.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
A throughput-oriented high-performance serving framework for LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A throughput-oriented high-performance serving framework for LLMs
SGLang is a fast serving framework for large language models and vision language models.
A high-throughput and memory-efficient inference and serving engine for LLMs
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
A throughput-oriented high-performance serving framework for LLMs
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source mod...
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
A throughput-oriented high-performance serving framework for LLMs
SGLang is a fast serving framework for large language models and vision language models.
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
A high-throughput and memory-efficient inference and serving engine for LLMs
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source mod...
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
A collection of all available inference solutions for the LLMs
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
SGLang is a fast serving framework for large language models and vision language models.
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
It is a comprehensive resource hub compiling all LLM papers accepted at the International Conference on Learning Representations (ICLR) in 2024.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
A throughput-oriented high-performance serving framework for LLMs
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source mod...
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
SGLang is a fast serving framework for large language models and vision language models.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
Multi-node production GenAI stack. Run the best of open source AI easily on your own servers. Easily add knowledge from documents and scrape websites. Create your own AI by fine-tuning open source mod...
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
Superduper: build end-2-end AI applications and templates using your existing data infrastructure and tools of choice
A collection of all available inference solutions for the LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
Friendli: the fastest serving engine for generative AI
This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Run GPU inference and training jobs on serverless infrastructure that scales with you.
Run any open-source LLMs, such as Llama, Gemma, as OpenAI compatible API endpoint in the cloud.