Trending repositories for topic llm-inference
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Gradio based tool to run opensource LLM models directly from Huggingface
Sequence Parallel Attention for Long Context LLM Model Training and Inference
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
A Python package for LLM dynamic routing through the Unify REST API.
Gradio based tool to run opensource LLM models directly from Huggingface
Sequence Parallel Attention for Long Context LLM Model Training and Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
开源的智能体项目 支持6种聊天平台 Onebotv11一对多连接 流式信息 agent 对话keyboard气泡生成 支持6种大模型接口(持续增加中) 具有将多种大模型接口转化为带有上下文的通用格式的能力.
An innovative library for efficient LLM inference via low-bit quantization
A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
A library to communicate with ChatGPT, Claude, Copilot, Gemini, HuggingChat, and Pi
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
Sequence Parallel Attention for Long Context LLM Model Training and Inference
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
A Python package for LLM dynamic routing through the Unify REST API.
Gradio based tool to run opensource LLM models directly from Huggingface
Sequence Parallel Attention for Long Context LLM Model Training and Inference
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
开源的智能体项目 支持6种聊天平台 Onebotv11一对多连接 流式信息 agent 对话keyboard气泡生成 支持6种大模型接口(持续增加中) 具有将多种大模型接口转化为带有上下文的通用格式的能力.
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
An innovative library for efficient LLM inference via low-bit quantization
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalab...
Sequence Parallel Attention for Long Context LLM Model Training and Inference
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
Gradio based tool to run opensource LLM models directly from Huggingface
A Python package for LLM dynamic routing through the Unify REST API.
Sequence Parallel Attention for Long Context LLM Model Training and Inference
LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
Optimized local inference for LLMs with HuggingFace-like APIs for quantization, vision/language models, multimodal agents, speech, vector DB, and RAG.
Design, conduct and analyze results of AI-powered surveys and experiments. Simulate social science and market research with large numbers of AI agents and LLMs.
开源的智能体项目 支持6种聊天平台 Onebotv11一对多连接 流式信息 agent 对话keyboard气泡生成 支持6种大模型接口(持续增加中) 具有将多种大模型接口转化为带有上下文的通用格式的能力.
Efficient and general syntactical decoding for Large Language Models
Test and evaluate LLMs and model configurations, across all the scenarios that matter for your application
An innovative library for efficient LLM inference via low-bit quantization
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Code examples and resources for DBRX, a large language model developed by Databricks
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
[ICML'24] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
The official repo of Aquila2 series proposed by BAAI, including pretrained & chat large language models.
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalab...
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Code examples and resources for DBRX, a large language model developed by Databricks
Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Run any open-source LLMs, such as Llama 2, Mistral, as OpenAI compatible API endpoint in the cloud.
🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalab...
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
An innovative library for efficient LLM inference via low-bit quantization
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
Pretrain, finetune, deploy 20+ LLMs on your own data. Uses state-of-the-art techniques: flash attention, FSDP, 4-bit, LoRA, and more.
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
Official inference library for Mistral models
Sequence Parallel Attention for Long Context LLM Model Training and Inference
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
[ICML'24] EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource...
[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
Gradio based tool to run opensource LLM models directly from Huggingface
Minimalist web-searching app with an AI assistant that runs directly from your browser. Uses Web-LLM, Ratchet-ML, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space