Trending repositories for topic inference
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
Cross-platform, customizable ML solutions for live and streaming media.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
ncnn is a high-performance neural network inference framework optimized for the mobile platform
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
Large Language Model Text Generation Inference
Making large AI models cheaper, faster and more accessible
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
📒A small curated list of Awesome Diffusion Inference Papers with codes.
whisper-cpp-serve Real-time speech recognition and c+ of OpenAI's Whisper model in C/C++
The Qualcomm® AI Hub apps are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
SGLang is a fast serving framework for large language models and vision language models.
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
A high-throughput and memory-efficient inference and serving engine for LLMs
The Triton backend for the ONNX Runtime.
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
High performance AI inference stack. Built for production. @ziglang / @openxla / MLIR / @bazelbuild
Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Cross-platform, customizable ML solutions for live and streaming media.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Large Language Model Text Generation Inference
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
ncnn is a high-performance neural network inference framework optimized for the mobile platform
Making large AI models cheaper, faster and more accessible
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
Extensible generative AI platform on Kubernetes with OpenAI-compatible APIs.
The Qualcomm® AI Hub apps are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
whisper-cpp-serve Real-time speech recognition and c+ of OpenAI's Whisper model in C/C++
📒A small curated list of Awesome Diffusion Inference Papers with codes.
Instantly calculate the maximum size of quantized language models that can fit in your available RAM, helping you optimize your models for inference.
Wip From NTTS Piper Ultra Fast And Efficient Text To Speech Library For Cross Platform Work on any edge device with cpu only
SGLang is a fast serving framework for large language models and vision language models.
PocketGroq is a powerful Python library that simplifies integration with the Groq API, offering advanced features for natural language processing, web scraping, and autonomous agent capabilities. Key ...
Implemented in pure pascal LightNet is an artificial intelligence neural network library Inspired by Darknet and yolo library which can run most of the darknet including YOLO models nativly and self d...
Python library for YOLO small object detection and instance segmentation
PyTorch native quantization and sparsity for training and inference
A high-throughput and memory-efficient inference and serving engine for LLMs
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
Cross-platform, customizable ML solutions for live and streaming media.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
Large Language Model Text Generation Inference
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
ncnn is a high-performance neural network inference framework optimized for the mobile platform
PyTorch native quantization and sparsity for training and inference
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
High performance AI inference stack. Built for production. @ziglang / @openxla / MLIR / @bazelbuild
Making large AI models cheaper, faster and more accessible
Extensible generative AI platform on Kubernetes with OpenAI-compatible APIs.
Instantly calculate the maximum size of quantized language models that can fit in your available RAM, helping you optimize your models for inference.
The Qualcomm® AI Hub apps are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
Minimal code and examnples for inferencing Sapiens foundation human models in Pytorch
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Code for ACM MobiCom 2024 paper "FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices"
whisper-cpp-serve Real-time speech recognition and c+ of OpenAI's Whisper model in C/C++
📒A small curated list of Awesome Diffusion Inference Papers with codes.
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).
PocketGroq is a powerful Python library that simplifies integration with the Groq API, offering advanced features for natural language processing, web scraping, and autonomous agent capabilities. Key ...
Python library for YOLO small object detection and instance segmentation
Neural Network-Boosted Importance Nested Sampling for Bayesian Statistics
Deep learned, NVIDIA-accelerated 3D object pose estimation
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
PyTorch native quantization and sparsity for training and inference
Wip From NTTS Piper Ultra Fast And Efficient Text To Speech Library For Cross Platform Work on any edge device with cpu only
SGLang is a fast serving framework for large language models and vision language models.
High performance AI inference stack. Built for production. @ziglang / @openxla / MLIR / @bazelbuild
AI Productivity Tool - Free and open-source, enhancing user productivity while ensuring privacy and data security. It provides efficient and convenient AI solutions, including but not limited to: buil...
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
T-GATE: Temporally Gating Attention to Accelerate Diffusion Model for Free!
Super performant RAG pipelines for AI apps. Summarization, Retrieve/Rerank and Code Interpreters in one simple API.
Python library for YOLO small object detection and instance segmentation
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
[deprecated] AI Gateway - core infrastructure stack for building production-ready AI Applications
PocketGroq is a powerful Python library that simplifies integration with the Groq API, offering advanced features for natural language processing, web scraping, and autonomous agent capabilities. Key ...
A high-throughput and memory-efficient inference and serving engine for LLMs
SGLang is a fast serving framework for large language models and vision language models.
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...
Cross-platform, customizable ML solutions for live and streaming media.
Making large AI models cheaper, faster and more accessible
Large Language Model Text Generation Inference
🎨 The exhaustive Pattern Matching library for TypeScript, with smart type inference.
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
ncnn is a high-performance neural network inference framework optimized for the mobile platform
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国...
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
PyTorch native quantization and sparsity for training and inference
SGLang is a fast serving framework for large language models and vision language models.
PyTorch native quantization and sparsity for training and inference
PocketGroq is a powerful Python library that simplifies integration with the Groq API, offering advanced features for natural language processing, web scraping, and autonomous agent capabilities. Key ...
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Qualcomm Cloud AI SDK (Platform and Apps) enable high performance deep learning inference on Qualcomm Cloud AI platforms delivering high throughput and low latency across Computer Vision, Object Detec...
Superduper: Build end-to-end AI applications and agent workflows on your existing data infrastructure and preferred tools - without migrating your data.
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
One-click toolkit for provisioning servers to deploy and serve Large Language Models (LLMs).
Minimal code and examnples for inferencing Sapiens foundation human models in Pytorch
klmbr - a prompt pre-processing technique to break through the barrier of entropy while generating text with LLMs
Training YOLOv9 for face detection on the WIDER Face dataset
[ACL 2024]Controlled Text Generation for Large Language Model with Dynamic Attribute Graphs
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any...