39 results found Sort:

163
2.2k
other
48
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Created 2024-08-10
127 commits to main branch, last one about a month ago
Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
Created 2023-04-12
939 commits to main branch, last one 29 days ago
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Created 2024-06-06
44 commits to master branch, last one 5 months ago
81
772
apache-2.0
11
A Framework of Small-scale Large Multimodal Models
Created 2024-02-21
223 commits to main branch, last one about a month ago
LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills
Created 2023-11-07
404 commits to main branch, last one about a year ago
A collection of resources on applications of multi-modal learning in medical imaging.
Created 2022-07-13
158 commits to main branch, last one 22 days ago
18
413
apache-2.0
9
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
Created 2025-01-07
8 commits to main branch, last one 2 months ago
33
403
apache-2.0
3
This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"
Created 2023-11-23
147 commits to main branch, last one 11 days ago
An open-source implementation for training LLaVA-NeXT.
Created 2024-05-11
36 commits to master branch, last one 4 months ago
28
322
mit
3
[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Created 2023-11-30
50 commits to main branch, last one 6 months ago
17
300
apache-2.0
13
Open Platform for Embodied Agents
Created 2024-03-13
129 commits to main branch, last one 2 months ago
26
275
apache-2.0
8
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Created 2024-07-20
109 commits to main branch, last one about a month ago
[ECCV 2024] ShareGPT4V: Improving Large Multi-modal Models with Better Captions
Created 2024-06-06
3 commits to master branch, last one 8 months ago
14
182
apache-2.0
3
Embed arbitrary modalities (images, audio, documents, etc) into large language models.
Created 2023-10-11
84 commits to main branch, last one 11 months ago
[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"
Created 2024-03-29
19 commits to main branch, last one 5 months ago
A curated list of awesome Multimodal studies.
Created 2024-04-05
74 commits to main branch, last one a day ago
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
Created 2024-10-09
17 commits to main branch, last one 3 months ago
6
85
apache-2.0
0
[ECCV 2024] BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Created 2023-11-20
93 commits to main branch, last one 7 months ago
2
79
unknown
5
(NeurIPS 2024) Official PyTorch implementation of LOVA3
Created 2024-05-19
41 commits to main branch, last one a day ago
[ECCV 2024] API: Attention Prompting on Image for Large Vision-Language Models
Created 2024-09-04
14 commits to master branch, last one 5 months ago
3
73
apache-2.0
1
[ICLR 2025] Reconstructive Visual Instruction Tuning
Created 2024-10-11
9 commits to master branch, last one 19 days ago
2
68
apache-2.0
8
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding capabi...
Created 2025-01-23
78 commits to main branch, last one 8 days ago
(ICLR'25) A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents
Created 2024-07-26
16 commits to main branch, last one about a month ago
Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.
Created 2025-02-12
137 commits to main branch, last one 8 days ago
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
Created 2024-04-02
26 commits to main branch, last one 5 months ago
3
42
apache-2.0
1
[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"
Created 2024-08-20
9 commits to main branch, last one 27 days ago
LMM solved catastrophic forgetting, AAAI2025
Created 2024-08-23
5 commits to main branch, last one 4 months ago
OpenOmni: Official implementation of Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis
Created 2025-01-11
52 commits to main branch, last one 4 days ago
4
38
unknown
3
The official repo for “TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding”.
Created 2024-04-15
20 commits to main branch, last one 5 months ago