Repository navigation

#

vision-language-model

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python
23657
1 年前

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python
9285
13 天前

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Python
6275
1 年前

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

Python
4781
5 个月前

Align Anything: Training All-modality Model with Feedback

Jupyter Notebook
4556
1 个月前

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Python
3966
1 年前

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Python
3322
1 年前

The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

Python
3156
3 个月前

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2295
1 年前

The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

Python
2235
2 天前

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

Python
1671
1 天前

[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

Python
1497
4 个月前

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Jupyter Notebook
1448
4 个月前

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Python
1364
13 天前

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Python
1351
3 个月前

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Python
1309
2 年前