65 results found Sort:

557
5.3k
apache-2.0
35
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Created 2023-03-09
78 commits to main branch, last one 8 days ago
576
4.4k
bsd-3-clause
34
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Created 2022-01-25
64 commits to main branch, last one about a year ago
178
4.2k
apache-2.0
35
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Created 2022-08-01
1,121 commits to mainline branch, last one 4 days ago
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Created 2022-07-08
374 commits to master branch, last one 6 months ago
247
2.3k
apache-2.0
21
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Created 2022-01-29
712 commits to main branch, last one 9 months ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
Created 2022-09-28
54 commits to main branch, last one about a month ago
88
996
cc-by-4.0
14
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...
Created 2023-05-18
40 commits to main branch, last one 12 days ago
52
858
apache-2.0
12
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Created 2023-05-18
134 commits to main branch, last one 6 months ago
67
825
apache-2.0
18
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Created 2022-03-08
35 commits to main branch, last one about a year ago
46
690
unknown
10
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
Created 2024-04-26
11 commits to main branch, last one about a month ago
40
660
apache-2.0
20
DriveLM: Driving with Graph Visual Question Answering
Created 2023-08-08
395 commits to main branch, last one 18 days ago
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
Created 2023-09-30
39 commits to main branch, last one about a month ago
28
525
apache-2.0
10
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created 2023-11-27
88 commits to main branch, last one 2 months ago
27
499
other
14
Official implementation of SEED-LLaMA (ICLR 2024).
Created 2023-07-15
77 commits to main branch, last one about a month ago
80
425
apache-2.0
6
CLIPort: What and Where Pathways for Robotic Manipulation
Created 2021-09-20
91 commits to master branch, last one about a year ago
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
Created 2023-06-16
16 commits to main branch, last one 10 months ago
27
372
apache-2.0
11
A Framework of Small-scale Large Multimodal Models
Created 2024-02-21
168 commits to main branch, last one a day ago
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Created 2021-07-23
7 commits to main branch, last one 2 years ago
44
280
mit
6
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Created 2021-07-20
259 commits to main branch, last one 2 months ago
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Created 2021-03-05
153 commits to main branch, last one about a year ago
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Created 2023-10-14
10 commits to main branch, last one 5 months ago
24
260
mit
4
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Created 2022-03-10
64 commits to main branch, last one 7 months ago
Tools for movie and video research
Created 2019-06-05
91 commits to master branch, last one about a year ago
13
209
apache-2.0
4
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Created 2023-07-15
38 commits to main branch, last one about a month ago
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Created 2022-10-07
12 commits to main branch, last one 12 months ago
12
199
mit
18
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Created 2023-05-29
13 commits to master branch, last one 2 months ago
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Created 2023-12-02
38 commits to main branch, last one about a month ago
4
166
unknown
4
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
Created 2023-07-25
17 commits to main branch, last one 7 months ago
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021
Created 2021-01-10
72 commits to master branch, last one about a year ago