Trending repositories for topic llm-inference
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
A list of software that allows searching the web with the assistance of AI.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Build responsible, controlled and transparent applications on top of LLM models!
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
A list of software that allows searching the web with the assistance of AI.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
Build responsible, controlled and transparent applications on top of LLM models!
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
A low-latency & high-throughput serving engine for LLMs
dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
PyTorch library for cost-effective, fast and easy serving of MoE models.
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
A list of software that allows searching the web with the assistance of AI.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
A list of software that allows searching the web with the assistance of AI.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
Build responsible, controlled and transparent applications on top of LLM models!
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
A low-latency & high-throughput serving engine for LLMs
PalmHill.BlazorChat is a chat application and API built with Blazor WebAssembly, SignalR, and WebAPI, featuring real-time LLM conversations, markdown support, customizable settings, and a responsive d...
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"
Telegram bot for different language models. Supports system prompts and images
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
Openai-style, fast & lightweight local language model inference w/ documents
A list of software that allows searching the web with the assistance of AI.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
A list of software that allows searching the web with the assistance of AI.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
A list of software that allows searching the web with the assistance of AI.
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
Build responsible, controlled and transparent applications on top of LLM models!
Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*
Code for ACL 2024 paper "TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space"
A low-latency & high-throughput serving engine for LLMs
PalmHill.BlazorChat is a chat application and API built with Blazor WebAssembly, SignalR, and WebAPI, featuring real-time LLM conversations, markdown support, customizable settings, and a responsive d...
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
🌱 EcoLogits tracks the energy consumption and environmental footprint of using generative AI models through APIs.
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Code examples and resources for DBRX, a large language model developed by Databricks
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with your APIs - outside business logic. Built by ...
A list of software that allows searching the web with the assistance of AI.
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Code examples and resources for DBRX, a large language model developed by Databricks
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
LLM (Large Language Model) FineTuning
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
LLM.swift is a simple and readable library that allows you to interact with large language models locally with ease for macOS, iOS, watchOS, tvOS, and visionOS.
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
An innovative library for efficient LLM inference via low-bit quantization
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource...
🎤📄 An innovative tool that transforms audio or video files into text transcripts and generates concise meeting minutes. Stay organized and efficient in your meetings, and get ready for Phase 2 where...
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.