Search Results - RepositoryStats

797

7.9k

apache-2.0

46

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

open-world vision-language object-detection open-world-detection vision-language-transformer

Created 2023-03-09

84 commits to main branch, last one 8 months ago

BLIP salesforce

683

5.2k

bsd-3-clause

31

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

vision-language image-captioning visual-reasoning image-text-retrieval visual-question-answering vision-language-transformer vision-and-language-pre-training

Created 2022-01-25

64 commits to main branch, last one 2 years ago

Chinese-CLIP OFA-Sys

494

5.1k

mit

37

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

nlp clip chinese pytorch multi-modal transformers coreml-models deep-learning computer-vision vision-language contrastive-loss pretrained-models image-text-retrieval multi-modal-learning vision-and-language-pre-training

Created 2022-07-08

382 commits to master branch, last one 8 months ago

marqo marqo-ai

203

4.8k

apache-2.0

39

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

Created 2022-08-01

1,557 commits to mainline branch, last one 2 days ago

OFA OFA-Sys

249

2.5k

apache-2.0

20

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

prompt chinese multimodal pretraining prompt-tuning vision-language image-captioning pretrained-models text-to-image-synthesis visual-question-answering referring-expression-comprehension

Created 2022-01-29

712 commits to main branch, last one about a year ago

AdvancedLiterateMachinery AlibabaResearch

190

1.7k

apache-2.0

41

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Created 2022-09-28

70 commits to main branch, last one 17 days ago

Video-ChatGPT mbzuai-oryx

112

1.3k

cc-by-4.0

14

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...

clip gpt-4 llama llava vicuna chatbot mulit-modal video-chatboat vision-language video-conversation vision-language-pretraining

Created 2023-05-18

44 commits to main branch, last one 27 days ago

awesome-japanese-llm llm-jp

34

1.1k

apache-2.0

28

日本語LLMまとめ - Overview of Japanese LLMs

llm llms japanese multimodal japanese-llm llm-japanese generative-ai language-model language-models vision-language generative-model foundation-models generative-models japanese-language vision-and-language large-language-model large-language-models vision-language-model japanese-language-model

Created 2023-07-09

519 commits to main branch, last one 12 days ago

DriveLM OpenDriveLab

67

1.0k

apache-2.0

22

[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

llm prompting vision-language chain-of-thought tree-of-thoughts graph-of-thoughts autonomous-driving prompt-engineering large-language-models

Created 2023-08-08

415 commits to main branch, last one about a month ago

ONE-PEACE OFA-Sys

71

1.0k

apache-2.0

14

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

multimodal audio-language vision-language contrastive-loss foundation-models vision-transformer vision-and-language representation-learning

Created 2023-05-18

136 commits to main branch, last one 6 months ago

pix2seq google-research

71

906

apache-2.0

18

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

pix2seq tensorflow2 deep-learning computer-vision vision-language object-detection

This repository has been archived (exclude archived)

Created 2022-03-08

35 commits to main branch, last one about a year ago

LLaVA-pp mbzuai-oryx

61

836

unknown

9

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

llm lmms phi3 llava llama3 llava-phi3 phi3-llava phi-3-llava phi3-vision conversation llama3-llava llava-llama3 phi-3-vision llama-3-llava llama3-vision llama-3-vision vision-language

Created 2024-04-26

11 commits to main branch, last one 12 months ago

AlphaCLIP SunzeY

56

809

apache-2.0

12

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

deep-learning vision-language machine-learning vision-transformer vision-and-language vision-language-model

Created 2023-11-27

97 commits to main branch, last one 9 months ago

TinyLLaVA_Factory TinyLLaVA

83

800

apache-2.0

11

A Framework of Small-scale Large Multimodal Models

nlp llama llava tinyllama transformers vision-language large-multimodal-models

Created 2024-02-21

225 commits to main branch, last one a day ago

daclip-uir Algolzw

40

754

mit

9

[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

prompt pytorch deep-learning image-dehazing shadow-removal face-inpainting image-denoising image-deraining image-desnowing vision-language diffusion-models image-deblurring low-level-vision image-restoration jpeg-artifacts-removal low-light-image-enhancement

Created 2023-09-30

40 commits to main branch, last one 8 months ago

Qwen2-VL-Finetune 2U1

79

666

apache-2.0

6

An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

chatbot qwen2-5 qwen2-vl multimodal vision-language vision-language-model

Created 2024-09-10

111 commits to master branch, last one a day ago

SEED AILab-CVC

33

611

other

16

Official implementation of SEED-LLaMA (ICLR 2024).

multimodal vision-language foundation-model

Created 2023-07-15

81 commits to main branch, last one 7 months ago

Open-GroundingDino longzw1997

109

584

mit

6

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

open-world vision-language object-detection open-world-detection

Created 2023-10-14

12 commits to main branch, last one 10 months ago

calvin mees

72

542

mit

5

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

vision pytorch robotics grounding manipulation deep-learning computer-vision vision-language vision-and-language natural-language-processing

Created 2021-07-20

271 commits to main branch, last one 2 months ago

cliport cliport

89

492

apache-2.0

6

CLIPort: What and Where Pathways for Robotic Manipulation

clip vision pytorch robotics grounding manipulation deep-learning rearrangement computer-vision vision-language natural-language-processing

Created 2021-09-20

91 commits to master branch, last one about a year ago

Visual-Chinese-LLaMA-Alpaca airaria

37

445

apache-2.0

9

多模态中文LLaMA&Alpaca大语言模型（VisualCLA）

llm nlp lora llama alpaca chinese multimodal vision-language

Created 2023-06-16

16 commits to main branch, last one about a year ago

RemoteCLIP ChenDelong1999

22

376

apache-2.0

4

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)

remote-sensing vision-language contrastive-language-image-pretraining

Created 2023-07-15

39 commits to main branch, last one 10 months ago

METER zdou0830

34

369

mit

6

METER: A Multimodal End-to-end TransformER Framework

vision-language

Created 2021-11-03

20 commits to main branch, last one 2 years ago

Vision-Language-Transformer henghuiding

23

353

mit

4

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

keras tpami iccv2021 tensorflow transformer vision-language referring-segmentation vision-language-transformer

Created 2021-07-23

7 commits to main branch, last one 3 years ago

LViT HUANGLIZI

32

338

mit

2

[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"

pytorch segmentation vision-language multimodal-learning medical-image-analysis

Created 2022-03-10

65 commits to main branch, last one about a month ago

ViP-LLaVA WisconsinAIVision

23

319

apache-2.0

6

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

clip gpt-4 llama llava llama2 chatbot cvpr2024 multi-modal gpt-4-vision vision-language visual-prompting foundation-models

Created 2023-12-02

44 commits to main branch, last one 9 months ago

lmms-finetune zjysteven

31

293

apache-2.0

8

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.