106 results found Sort:
- Filter by Primary Language:
- Python (84)
- Jupyter Notebook (6)
- C++ (1)
- Rust (1)
- TypeScript (1)
- Markdown (1)
- Java (1)
- +
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Created
2023-04-17
460 commits to main branch, last one 7 months ago
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Created
2023-11-22
230 commits to main branch, last one 3 hours ago
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Created
2023-08-21
136 commits to master branch, last one 8 months ago
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Created
2024-03-26
35 commits to main branch, last one 7 months ago
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Created
2023-09-26
409 commits to main branch, last one 3 days ago
Collection of AWESOME vision-language models for vision tasks
Created
2023-03-30
89 commits to main branch, last one 18 days ago
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Created
2024-03-07
11 commits to main branch, last one 8 months ago
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, ...
Created
2024-03-03
35 commits to main branch, last one about a month ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
ocr
document
documentai
multimodal
end-to-end-ocr
text-detection
computer-vision
vision-language
text-recognition
document-analysis
document-recognition
scene-text-detection
document-intelligence
vision-language-model
document-understanding
scene-text-recognition
artificial-intelligence
multimodal-deep-learning
vision-language-transformer
scene-text-detection-recognition
Created
2022-09-28
64 commits to main branch, last one 5 days ago
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Created
2023-03-02
36 commits to main branch, last one 11 months ago
The code used to train and run inference with the ColPali architecture.
Created
2024-06-20
125 commits to main branch, last one 8 days ago
日本語LLMまとめ - Overview of Japanese LLMs
Created
2023-07-09
486 commits to main branch, last one 5 days ago
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Created
2023-11-13
83 commits to main branch, last one 2 months ago
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Created
2023-11-02
43 commits to main branch, last one 28 days ago
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created
2023-11-27
97 commits to main branch, last one 4 months ago
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Created
2023-11-02
3 commits to main branch, last one about a year ago
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Created
2024-04-16
180 commits to main branch, last one 15 days ago
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
Created
2024-08-12
41 commits to main branch, last one about a month ago
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Created
2024-04-21
30 commits to main branch, last one 6 months ago
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Created
2024-06-13
32 commits to main branch, last one 25 days ago
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
Created
2023-02-04
128 commits to main branch, last one 7 months ago
Famous Vision Language Models and Their Architectures
Created
2024-02-15
231 commits to main branch, last one 3 months ago
Parsing-free RAG supported by VLMs
Created
2024-10-14
73 commits to master branch, last one 10 days ago
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP...
Created
2023-05-10
86 commits to main branch, last one 8 months ago
「大模型」3小时从0训练27M参数的视觉多模态VLM,个人显卡即可推理训练!
Created
2024-09-11
96 commits to master branch, last one 8 days ago
The open source Meme Search Engine. Free and built to self-host locally with Python, Ruby, and Docker.
Created
2024-06-08
313 commits to main branch, last one 17 days ago
A curated list of awesome knowledge-driven autonomous driving (continually updated)
Created
2023-10-24
51 commits to main branch, last one 6 months ago
An open-source implementation for training LLaVA-NeXT.
Created
2024-05-11
36 commits to master branch, last one about a month ago
A minimalist yet highly performant, lightweight, lightning fast, multisource, multimodal and local Ingestion, Inference and Indexing solution, built in Rust.
Created
2024-03-31
525 commits to main branch, last one 4 hours ago
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.
Created
2024-03-15
44 commits to main branch, last one a day ago