118 results found Sort:

399
4.0k
mit
62
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Created 2024-01-26
192 commits to main branch, last one 5 days ago
171
3.3k
apache-2.0
45
StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textu...
Created 2023-12-11
10 commits to main branch, last one 11 days ago
345
3.1k
apache-2.0
40
ModelScope-Agent: An agent framework connecting models in ModelScope with the world
Created 2023-08-03
475 commits to master branch, last one about a month ago
194
2.9k
apache-2.0
31
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
Created 2024-09-10
13 commits to main branch, last one 4 months ago
165
2.2k
other
49
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Created 2024-08-10
128 commits to main branch, last one 10 days ago
127
2.1k
apache-2.0
33
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Created 2023-07-04
135 commits to main branch, last one 3 months ago
129
1.9k
apache-2.0
23
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
Created 2024-06-17
59 commits to main branch, last one 5 months ago
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)
Created 2024-01-22
60 commits to main branch, last one 2 months ago
76
1.0k
apache-2.0
19
A family of lightweight multimodal models.
Created 2024-01-31
114 commits to main branch, last one 4 months ago
57
868
apache-2.0
13
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Created 2024-06-13
40 commits to main branch, last one 12 days ago
111
842
mit
12
实时语音交互数字人,支持端到端语音方案(GLM-4-Voice - THG)和级联方案(ASR-LLM-TTS-THG)。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cas...
Created 2024-10-18
41 commits to master branch, last one 16 days ago
77
768
mit
23
Speech, Language, Audio, Music Processing with Large Language Model
Created 2023-10-23
886 commits to main branch, last one about a month ago
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Created 2023-11-07
404 commits to main branch, last one about a year ago
A collection of resources on applications of multi-modal learning in medical imaging.
Created 2022-07-13
161 commits to main branch, last one 5 days ago
27
656
mit
10
Large-Scale Visual Representation Model
Created 2023-02-15
174 commits to main branch, last one 18 hours ago
31
634
unknown
16
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
Created 2023-09-26
107 commits to main branch, last one 3 months ago
42
606
bsd-3-clause
12
[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Created 2023-06-26
122 commits to main branch, last one 2 months ago
30
519
unknown
15
NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Created 2024-03-18
78 commits to main branch, last one 5 months ago
20
500
unknown
5
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Created 2024-06-02
55 commits to main branch, last one 10 days ago
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
Created 2023-11-17
357 commits to main branch, last one 2 days ago
25
443
unknown
10
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Created 2024-05-20
19 commits to main branch, last one 26 days ago
19
436
apache-2.0
9
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Created 2025-01-07
8 commits to main branch, last one 2 months ago
23
432
unknown
6
Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train...
Created 2023-10-24
135 commits to master branch, last one 27 days ago
19
386
apache-2.0
8
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
Created 2024-06-12
49 commits to main branch, last one 20 days ago
Research Trends in LLM-guided Multimodal Learning.
Created 2023-05-29
16 commits to main branch, last one about a year ago
Liquid: Language Models are Scalable and Unified Multi-modal Generators
Created 2024-12-12
10 commits to main branch, last one 13 days ago
31
344
unknown
12
A Gradio demo of MGIE
Created 2023-09-28
1 commits to main branch, last one about a year ago