120 results found Sort:
- Filter by Primary Language:
- Python (92)
- Jupyter Notebook (10)
- TypeScript (2)
- Rust (1)
- Markdown (1)
- Java (1)
- C++ (1)
- +
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Created
2023-04-17
460 commits to main branch, last one 8 months ago
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Created
2023-11-22
234 commits to main branch, last one about a month ago
The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Created
2023-08-21
136 commits to master branch, last one 9 months ago
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Created
2024-03-26
35 commits to main branch, last one 9 months ago
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Created
2024-03-07
11 commits to main branch, last one 9 months ago
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Created
2023-09-26
416 commits to main branch, last one 9 days ago
Collection of AWESOME vision-language models for vision tasks
Created
2023-03-30
89 commits to main branch, last one about a month ago
The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, ...
Created
2024-03-03
35 commits to main branch, last one 2 months ago
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
ocr
document
documentai
multimodal
end-to-end-ocr
text-detection
computer-vision
vision-language
text-recognition
document-analysis
document-recognition
scene-text-detection
document-intelligence
vision-language-model
document-understanding
scene-text-recognition
artificial-intelligence
multimodal-deep-learning
vision-language-transformer
scene-text-detection-recognition
Created
2022-09-28
69 commits to main branch, last one about a month ago
The code used to train and run inference with the ColPali architecture.
Created
2024-06-20
134 commits to main branch, last one 2 days ago
The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".
Created
2023-03-02
36 commits to main branch, last one about a year ago
日本語LLMまとめ - Overview of Japanese LLMs
Created
2023-07-09
498 commits to main branch, last one 4 days ago
Align Anything: Training All-modality Model with Feedback
Created
2024-07-14
89 commits to main branch, last one 7 days ago
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
Created
2024-12-20
6 commits to master branch, last one 8 days ago
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Created
2023-11-13
83 commits to main branch, last one 3 months ago
Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
Created
2024-10-31
218 commits to main branch, last one a day ago
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.
Created
2023-11-02
43 commits to main branch, last one 2 months ago
🚀 「大模型」3小时从0训练27M参数的视觉多模态VLM!🌏 Train a 27M-parameter VLM from scratch in just 3 hours!
Created
2024-09-11
96 commits to master branch, last one about a month ago
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Created
2023-11-27
97 commits to main branch, last one 6 months ago
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
Created
2024-04-16
197 commits to main branch, last one 2 days ago
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Created
2023-11-02
3 commits to main branch, last one about a year ago
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
Created
2024-08-12
41 commits to main branch, last one 2 months ago
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Created
2024-06-13
32 commits to main branch, last one 2 months ago
Famous Vision Language Models and Their Architectures
Created
2024-02-15
231 commits to main branch, last one 4 months ago
Parsing-free RAG supported by VLMs
Created
2024-10-14
117 commits to master branch, last one 12 days ago
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Created
2024-04-21
30 commits to main branch, last one 7 months ago
Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP...
Created
2023-05-10
86 commits to main branch, last one 9 months ago
[ ICLR 2024 ] Official Codebase for "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists"
Created
2023-02-04
128 commits to main branch, last one 9 months ago
The open source Meme Search Engine and Finder. Free and built to self-host locally with Python, Ruby, and Docker.
Created
2024-06-08
315 commits to main branch, last one 8 days ago
A curated list of awesome knowledge-driven autonomous driving (continually updated)
Created
2023-10-24
51 commits to main branch, last one 7 months ago