Statistics for topic multimodal
RepositoryStats tracks 652,496 Github repositories, of these 401 are tagged with the multimodal topic. The most common primary language for repositories using this topic is Python (270). Other languages include: Jupyter Notebook (43), TypeScript (11)
Stargazers over time for topic multimodal
Most starred repositories for topic multimodal (view more)
Trending repositories for topic multimodal (view more)
Offline inference engine for art, real-time voice conversations, LLM powered chatbots and automated workflows
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Open-source framework and platform for building real-time, multimodal, low-latency conversational voice AI agents. It features a workflow builder and supports C, C++, Go, Python, JavaScript, and TypeS...
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
Offline inference engine for art, real-time voice conversations, LLM powered chatbots and automated workflows
[ICLR 2025] Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality
Multimodal document parser for high quality data understanding and extraction
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
A Practical Course on Embeddings, RAG, Multimodal Models, and Agents with Amazon Nova.
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Offline inference engine for art, real-time voice conversations, LLM powered chatbots and automated workflows
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
Offline inference engine for art, real-time voice conversations, LLM powered chatbots and automated workflows
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
Pixeltable — AI Data infrastructure providing a declarative, incremental approach for multimodal workloads.
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
A Practical Course on Embeddings, RAG, Multimodal Models, and Agents with Amazon Nova.
Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
Open source multi-modal RAG for building AI apps over private knowledge.
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning.
A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities
[VLDB' 25] ChatTS: Understanding, Chat, Reasoning about Time Series with TS-MLLM
A configurable engine for analysing multi-lingual and multi-modal content.
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
Open-source framework and platform for building real-time, multimodal, low-latency conversational voice AI agents. It features a workflow builder and supports C, C++, Go, Python, JavaScript, and TypeS...
A visual playground for agentic workflows: Iterate over your agents 10x faster
The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, No-code agent builder, MCP compatibility, and more.
Janus-Series: Unified Multimodal Understanding and Generation Models
AI app store powered by 24/7 desktop history. open source | 100% local | dev friendly | 24/7 screen, mic recording
Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 500+ LLMs (Qwen3, Qwen3-MoE, Llama4, InternLM3, DeepSeek-R1, ...) and 200+ MLLMs (Qwen2.5-VL, Qwen2.5-Omni, Qwen2-Audio, Ovis2, InternVL3, Llava, GLM4v, ...
Open-source framework and platform for building real-time, multimodal, low-latency conversational voice AI agents. It features a workflow builder and supports C, C++, Go, Python, JavaScript, and TypeS...
A visual playground for agentic workflows: Iterate over your agents 10x faster
Open-source framework and platform for building real-time, multimodal, low-latency conversational voice AI agents. It features a workflow builder and supports C, C++, Go, Python, JavaScript, and TypeS...
Align Anything: Training All-modality Model with Feedback
Explore the Multimodal “Aha Moment” on 2B Model
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.