Search Results - RepositoryStats

980

10.1k

bsd-3-clause

97

LAVIS - A One-stop Library for Language-Vision Intelligence

salesforce deep-learning image-captioning vision-framework multimodal-datasets vision-and-language deep-learning-library multimodal-deep-learning visual-question-anwsering vision-language-pretraining vision-language-transformer

Created 2022-08-24

492 commits to main branch, last one about a month ago

Video-LLaMA DAMO-NLP-SG

265

2.9k

bsd-3-clause

33

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

blip2 llama minigpt4 multi-modal-chatgpt large-language-models cross-modal-pretraining video-language-pretraining vision-language-pretraining

Created 2023-05-06

145 commits to main branch, last one 6 months ago

DeepSeek-VL deepseek-ai

206

2.2k

mit

20

DeepSeek-VL: Towards Real-World Vision-Language Understanding

foundation-models vision-language-model vision-language-pretraining

Created 2024-03-07

11 commits to main branch, last one 8 months ago

Janus deepseek-ai

63

1.3k

mit

23

Janus-Series: Unified Multimodal Understanding and Generation Models

llm any-to-any multimodal unified-model foundation-models vision-language-pretraining

Created 2024-10-18

16 commits to main branch, last one about a month ago

Video-ChatGPT mbzuai-oryx

110

1.3k

cc-by-4.0

15

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...

clip gpt-4 llama llava vicuna chatbot mulit-modal video-chatboat vision-language video-conversation vision-language-pretraining

Created 2023-05-18

43 commits to main branch, last one 4 months ago

DeCLIP Sense-GVT

32

643

unknown

20

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

clip big-model zero-shot image-text multi-model self-supervised vision-language-pretraining

Created 2021-10-09

34 commits to main branch, last one 2 years ago

VALOR TXH-mercury

15

269

mit

11

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

audio-language-pretraining vision-language-pretraining audiovisual-language-pretraining multimodal-representation-learning

Created 2023-01-23

51 commits to master branch, last one 5 days ago

VideoGPT-plus mbzuai-oryx

15

236

cc-by-4.0

5

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

clip gpt4 gpt4o llava llama3 vicuna chatbot multimodal phi-3-mini dual-encoder image-encoder video-chatbot video-encoder vision-language video-conversation vision-language-pretraining

Created 2024-06-13

6 commits to main branch, last one 5 months ago

ptp sail-sg

4

150

apache-2.0

8

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

vlp cross-modality vision-language-pretraining

Created 2022-12-16

16 commits to main branch, last one about a year ago

RegionSpot Surrey-UP-Lab

4

121

other

1

Recognize Any Regions

zero-shot open-world auto-labeling open-vocabulary object-detection instance-segmentation vision-language-model vision-foundation-model vision-language-pretraining vision-language-foundation-model multimodal-representation-learning

Created 2023-10-30

12 commits to main branch, last one 12 days ago

FLAIR jusiro

10

93

apache-2.0

3

[MedIA'24] FLAIR: A Foundation LAnguage-Image model of the Retina for fundus image understanding.

medical-imaging foundation-models fundus-image-analysis vision-language-pretraining

Created 2023-07-17

25 commits to main branch, last one 7 months ago

Continual-CLIP vgthengane

3

86

apache-2.0

6

Official repository for "CLIP model is an Efficient Continual Learner".

clip baseline continual-learning foundational-models vision-language-pretraining

Created 2022-10-03

5 commits to master branch, last one 2 years ago

SegCLIP ArrowLuo

8

85

unknown

10

PyTorch implementation of ICML 2023 paper "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation"

open-vocabulary transfer-learning contrastive-learning semantic-segmentation vision-language-pretraining zero-shot-semantic-segmentation open-vocabulary-semantic-segmentation

Created 2022-11-11

8 commits to main branch, last one about a year ago

Multimodality-Representation-Learning marslanm

7

69

unknown

8

This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://dl....

cross-modal multimodal-pretext transformer-models multimodal-datasets multimodal-applications multimodal-deep-learning vision-language-pretraining multimodal-pre-trained-model

Created 2022-03-13

66 commits to main branch, last one about a year ago