81 results found Sort:

700
6.9k
apache-2.0
42
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Created 2023-03-09
84 commits to main branch, last one 3 months ago
648
4.9k
bsd-3-clause
34
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Created 2022-01-25
64 commits to main branch, last one 2 years ago
193
4.7k
apache-2.0
38
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Created 2022-08-01
1,239 commits to mainline branch, last one 10 hours ago
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Created 2022-07-08
382 commits to master branch, last one 3 months ago
248
2.4k
apache-2.0
21
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Created 2022-01-29
712 commits to main branch, last one about a year ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Created 2022-09-28
63 commits to main branch, last one 11 days ago
108
1.2k
cc-by-4.0
15
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...
Created 2023-05-18
43 commits to main branch, last one 3 months ago
64
980
apache-2.0
14
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Created 2023-05-18
136 commits to main branch, last one about a month ago
60
884
apache-2.0
21
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
Created 2023-08-08
412 commits to main branch, last one 2 months ago
72
873
apache-2.0
18
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Created 2022-03-08
35 commits to main branch, last one about a year ago
61
814
unknown
10
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
Created 2024-04-26
11 commits to main branch, last one 7 months ago
44
721
apache-2.0
13
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created 2023-11-27
97 commits to main branch, last one 4 months ago
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
Created 2023-09-30
40 commits to main branch, last one 3 months ago
70
669
apache-2.0
11
A Framework of Small-scale Large Multimodal Models
Created 2024-02-21
219 commits to main branch, last one 2 days ago
32
584
other
15
Official implementation of SEED-LLaMA (ICLR 2024).
Created 2023-07-15
81 commits to main branch, last one 2 months ago
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Created 2023-10-14
12 commits to main branch, last one 5 months ago
82
463
apache-2.0
6
CLIPort: What and Where Pathways for Robotic Manipulation
Created 2021-09-20
91 commits to master branch, last one about a year ago
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
Created 2023-06-16
16 commits to main branch, last one about a year ago
60
411
mit
6
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Created 2021-07-20
263 commits to main branch, last one 3 months ago
32
362
mit
6
METER: A Multimodal End-to-end TransformER Framework
Created 2021-11-03
20 commits to main branch, last one 2 years ago
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Created 2021-07-23
7 commits to main branch, last one 3 years ago
23
317
apache-2.0
5
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Created 2023-07-15
39 commits to main branch, last one 5 months ago
26
305
mit
3
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Created 2022-03-10
64 commits to main branch, last one about a year ago
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Created 2023-12-02
44 commits to main branch, last one 4 months ago
Tools for movie and video research
Created 2019-06-05
91 commits to master branch, last one 2 years ago
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Created 2021-03-05
153 commits to main branch, last one 2 years ago
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Created 2022-10-07
12 commits to main branch, last one about a year ago
17
245
mit
18
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Created 2023-05-29
13 commits to master branch, last one 8 months ago
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Created 2024-06-13
6 commits to main branch, last one 4 months ago