91 results found Sort:

797
7.9k
apache-2.0
46
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Created 2023-03-09
84 commits to main branch, last one 8 months ago
683
5.2k
bsd-3-clause
31
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Created 2022-01-25
64 commits to main branch, last one 2 years ago
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Created 2022-07-08
382 commits to master branch, last one 8 months ago
203
4.8k
apache-2.0
39
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Created 2022-08-01
1,557 commits to mainline branch, last one 2 days ago
249
2.5k
apache-2.0
20
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Created 2022-01-29
712 commits to main branch, last one about a year ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Created 2022-09-28
70 commits to main branch, last one 17 days ago
112
1.3k
cc-by-4.0
14
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...
Created 2023-05-18
44 commits to main branch, last one 27 days ago
67
1.0k
apache-2.0
22
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
Created 2023-08-08
415 commits to main branch, last one about a month ago
71
1.0k
apache-2.0
14
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Created 2023-05-18
136 commits to main branch, last one 6 months ago
71
906
apache-2.0
18
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
This repository has been archived (exclude archived)
Created 2022-03-08
35 commits to main branch, last one about a year ago
61
836
unknown
9
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
Created 2024-04-26
11 commits to main branch, last one 12 months ago
56
809
apache-2.0
12
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created 2023-11-27
97 commits to main branch, last one 9 months ago
83
800
apache-2.0
11
A Framework of Small-scale Large Multimodal Models
Created 2024-02-21
225 commits to main branch, last one a day ago
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
Created 2023-09-30
40 commits to main branch, last one 8 months ago
79
666
apache-2.0
6
An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.
Created 2024-09-10
111 commits to master branch, last one a day ago
33
611
other
16
Official implementation of SEED-LLaMA (ICLR 2024).
Created 2023-07-15
81 commits to main branch, last one 7 months ago
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Created 2023-10-14
12 commits to main branch, last one 10 months ago
72
542
mit
5
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Created 2021-07-20
271 commits to main branch, last one 2 months ago
89
492
apache-2.0
6
CLIPort: What and Where Pathways for Robotic Manipulation
Created 2021-09-20
91 commits to master branch, last one about a year ago
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
Created 2023-06-16
16 commits to main branch, last one about a year ago
22
376
apache-2.0
4
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Created 2023-07-15
39 commits to main branch, last one 10 months ago
34
369
mit
6
METER: A Multimodal End-to-end TransformER Framework
Created 2021-11-03
20 commits to main branch, last one 2 years ago
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Created 2021-07-23
7 commits to main branch, last one 3 years ago
32
338
mit
2
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Created 2022-03-10
65 commits to main branch, last one about a month ago
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Created 2023-12-02
44 commits to main branch, last one 9 months ago
31
293
apache-2.0
8
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
Created 2024-07-20
109 commits to main branch, last one 2 months ago
Tools for movie and video research
Created 2019-06-05
91 commits to master branch, last one 2 years ago
17
277
mit
16
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Created 2023-05-29
13 commits to master branch, last one about a year ago
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Created 2022-10-07
12 commits to main branch, last one about a year ago