Statistics for topic llm-inference
RepositoryStats tracks 579,129 Github repositories, of these 152 are tagged with the llm-inference topic. The most common primary language for repositories using this topic is Python (87). Other languages include: Jupyter Notebook (16), C++ (13)
Stargazers over time for topic llm-inference
Most starred repositories for topic llm-inference (view more)
Trending repositories for topic llm-inference (view more)
dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
dstack is an open-source alternative to Kubernetes, designed to simplify development, training, and deployment of AI across any cloud or on-prem. It supports NVIDIA, AMD, and TPU.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by t...
Arch is an intelligent prompt gateway. Engineered with (fast) LLMs for the secure handling, robust observability, and seamless integration of prompts with APIs - all outside business logic. Built by t...
Minimalist web-searching platform with an AI assistant that runs directly from your browser. Uses Web-LLM, Wllama and SearXNG. Demo: https://felladrin-minisearch.hf.space
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
Code examples and resources for DBRX, a large language model developed by Databricks
Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
Streamlines and simplifies prompt design for both developers and non-technical users with a low code approach.
An innovative library for efficient LLM inference via low-bit quantization
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
A high-performance inference system for large language models, designed for production environments.