106 results found Sort:

2.3k
20.8k
apache-2.0
157
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Created 2023-04-17
460 commits to main branch, last one 7 months ago
496
6.5k
mit
57
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Created 2023-11-22
230 commits to main branch, last one 3 hours ago
396
5.2k
other
49
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Created 2023-08-21
136 commits to master branch, last one 8 months ago
281
3.2k
apache-2.0
28
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Created 2024-03-26
35 commits to main branch, last one 7 months ago
159
2.7k
apache-2.0
43
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Created 2023-09-26
409 commits to main branch, last one 3 days ago
227
2.6k
unknown
125
Collection of AWESOME vision-language models for vision tasks
Created 2023-03-30
89 commits to main branch, last one 18 days ago
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Created 2024-03-07
11 commits to main branch, last one 8 months ago
168
1.9k
mit
26
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, ...
Created 2024-03-03
35 commits to main branch, last one about a month ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Created 2022-09-28
64 commits to main branch, last one 5 days ago
74
1.3k
other
16
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Created 2023-03-02
36 commits to main branch, last one 11 months ago
111
1.3k
mit
15
The code used to train and run inference with the ColPali architecture.
Created 2024-06-20
125 commits to main branch, last one 8 days ago
43
897
apache-2.0
7
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Created 2023-11-13
83 commits to main branch, last one 2 months ago
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Created 2023-11-02
43 commits to main branch, last one 28 days ago
46
738
apache-2.0
13
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created 2023-11-27
97 commits to main branch, last one 4 months ago
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Created 2023-11-02
3 commits to main branch, last one about a year ago
53
598
mit
8
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Created 2024-04-16
180 commits to main branch, last one 15 days ago
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
Created 2024-08-12
41 commits to main branch, last one about a month ago
61
585
apache-2.0
36
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Created 2024-04-21
30 commits to main branch, last one 6 months ago
33
584
apache-2.0
7
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Created 2024-06-13
32 commits to main branch, last one 25 days ago
50
535
other
32
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
Created 2023-02-04
128 commits to main branch, last one 7 months ago
Famous Vision Language Models and Their Architectures
Created 2024-02-15
231 commits to main branch, last one 3 months ago
36
505
apache-2.0
9
Parsing-free RAG supported by VLMs
Created 2024-10-14
73 commits to master branch, last one 10 days ago
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP...
Created 2023-05-10
86 commits to main branch, last one 8 months ago
46
443
apache-2.0
14
「大模型」3小时从0训练27M参数的视觉多模态VLM,个人显卡即可推理训练!
Created 2024-09-11
96 commits to master branch, last one 8 days ago
17
422
apache-2.0
3
The open source Meme Search Engine. Free and built to self-host locally with Python, Ruby, and Docker.
Created 2024-06-08
313 commits to main branch, last one 17 days ago
A curated list of awesome knowledge-driven autonomous driving (continually updated)
Created 2023-10-24
51 commits to main branch, last one 6 months ago
An open-source implementation for training LLaVA-NeXT.
Created 2024-05-11
36 commits to master branch, last one about a month ago
A minimalist yet highly performant, lightweight, lightning fast, multisource, multimodal and local Ingestion, Inference and Indexing solution, built in Rust.
Created 2024-03-31
525 commits to main branch, last one 4 hours ago
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
Created 2024-03-15
44 commits to main branch, last one a day ago