65 results found Sort:
- Filter by Primary Language:
- Python (46)
- Jupyter Notebook (9)
- C++ (2)
- HTML (1)
- +
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Created
2023-03-09
78 commits to main branch, last one 8 days ago
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Created
2022-01-25
64 commits to main branch, last one about a year ago
Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai
Created
2022-08-01
1,121 commits to mainline branch, last one 4 days ago
Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.
Created
2022-07-08
374 commits to master branch, last one 6 months ago
Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Created
2022-01-29
712 commits to main branch, last one 9 months ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
ocr
document
documentai
multimodal
end-to-end-ocr
text-detection
computer-vision
vision-language
text-recognition
document-analysis
document-recognition
scene-text-detection
document-intelligence
vision-language-model
document-understanding
scene-text-recognition
artificial-intelligence
multimodal-deep-learning
vision-language-transformer
scene-text-detection-recognition
Created
2022-09-28
54 commits to main branch, last one about a month ago
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for ...
Created
2023-05-18
40 commits to main branch, last one 12 days ago
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Created
2023-05-18
134 commits to main branch, last one 6 months ago
Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)
Created
2022-03-08
35 commits to main branch, last one about a year ago
日本語LLMまとめ - Overview of Japanese LLMs
Created
2023-07-09
359 commits to main branch, last one 7 days ago
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
Created
2024-04-26
11 commits to main branch, last one about a month ago
DriveLM: Driving with Graph Visual Question Answering
Created
2023-08-08
395 commits to main branch, last one 18 days ago
[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.
Created
2023-09-30
39 commits to main branch, last one about a month ago
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created
2023-11-27
88 commits to main branch, last one 2 months ago
Official implementation of SEED-LLaMA (ICLR 2024).
Created
2023-07-15
77 commits to main branch, last one about a month ago
CLIPort: What and Where Pathways for Robotic Manipulation
Created
2021-09-20
91 commits to master branch, last one about a year ago
多模态中文LLaMA&Alpaca大语言模型(VisualCLA)
Created
2023-06-16
16 commits to main branch, last one 10 months ago
A Framework of Small-scale Large Multimodal Models
Created
2024-02-21
168 commits to main branch, last one a day ago
[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation
Created
2021-07-23
7 commits to main branch, last one 2 years ago
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks
Created
2021-07-20
259 commits to main branch, last one 2 months ago
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain
Created
2021-03-05
153 commits to main branch, last one about a year ago
This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Created
2023-10-14
10 commits to main branch, last one 5 months ago
[IEEE Transactions on Medical Imaging/TMI] This repo is the official implementation of "LViT: Language meets Vision Transformer in Medical Image Segmentation"
Created
2022-03-10
64 commits to main branch, last one 7 months ago
Tools for movie and video research
Created
2019-06-05
91 commits to master branch, last one about a year ago
🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
Created
2023-07-15
38 commits to main branch, last one about a month ago
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
Created
2022-10-07
12 commits to main branch, last one 12 months ago
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Created
2023-05-29
13 commits to master branch, last one 2 months ago
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Created
2023-12-02
38 commits to main branch, last one about a month ago
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
Created
2023-07-25
17 commits to main branch, last one 7 months ago
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021
Created
2021-01-10
72 commits to master branch, last one about a year ago