Trending repositories for topic llm-inference
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
A list of software that allows searching the web with the assistance of AI.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
A list of software that allows searching the web with the assistance of AI.
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
开源的智能体项目 支持6种聊天平台 Onebotv11一对多连接 流式信息 agent 对话keyboard气泡生成 支持10+大模型接口(持续更新) 具有将多种大模型接口转化为带有上下文的通用格式的能力.
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
A list of software that allows searching the web with the assistance of AI.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
A list of software that allows searching the web with the assistance of AI.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
Pure C++ implementation of several models for real-time chatting on your computer (CPU)
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
A list of software that allows searching the web with the assistance of AI.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
A list of software that allows searching the web with the assistance of AI.
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
[NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
Run serverless GPU workloads with fast cold starts on bare-metal servers, anywhere in the world
🌱 EcoLogits tracks the energy consumption and environmental footprint of using generative AI models through APIs.
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
Implementation of the paper Fast Inference from Transformers via Speculative Decoding, Leviathan et al. 2023.
Code and data to evaluate LLMs on the ENEM, the main standardized Brazilian university admission exams.
Code examples and resources for DBRX, a large language model developed by Databricks
Arch is an intelligent gateway for agents. Engineered with (fast) LLMs for the secure handling, rich observability, and seamless integration of prompts with your APIs - all outside business logic. Bui...
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
A list of software that allows searching the web with the assistance of AI.
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
A multi-platform SwiftUI frontend for running local LLMs with Apple's MLX framework.
LLM notes, including model inference, transformer model structure, and llm framework code analysis notes
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
SiLLM simplifies the process of training and running Large Language Models (LLMs) on Apple Silicon by leveraging the MLX framework.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient LLM GPU selections and cost-effective AI models. LLM provider p...
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Run any open-source LLMs, such as Llama, Mistral, as OpenAI compatible API endpoint in the cloud.
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
Code examples and resources for DBRX, a large language model developed by Databricks
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses WebLLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
An innovative library for efficient LLM inference via low-bit quantization
llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource...
Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inferenc...
It's time for a paradigm shift! The future of software is in plain English ✨
Gradio based tool to run opensource LLM models directly from Huggingface
LLM-PowerHouse: Unleash LLMs' potential through curated tutorials, best practices, and ready-to-use code for custom training and inferencing.