Repository navigation

#

vision-language-model

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Python
22263
8 个月前

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python
7628
2 天前

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

Python
5795
8 个月前

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Python
3787
1 年前

Align Anything: Training All-modality Model with Feedback

Jupyter Notebook
3405
4 天前

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Python
3270
1 年前

🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!

Python
2705
9 天前
2670
1 个月前

The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

Python
2523
9 天前

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2073
5 个月前

The code used to train and run inference with the ColVision models, e.g. ColPali, ColQwen2, and ColSmol.

Python
1742
21 小时前

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Python
1310
1 年前

[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.

Python
1189
1 个月前

MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.

Python
1185
2 小时前

This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

Jupyter Notebook
1062
3 个月前

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Python
932
6 个月前

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Python
891
25 天前