Repository navigation

#

large-multimodal-models

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python
2389
5 个月前
OpenAdaptAI/OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Python
1360
5 个月前

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Python
1310
2 个月前

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"

Python
1076
10 个月前
Python
875
4 个月前

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python
757
2 年前

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Python
518
2 个月前

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Python
358
1 年前
Python
343
2 个月前
Python
326
7 个月前

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Python
323
6 个月前

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Python
188
1 年前

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Python
186
1 年前

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Python
165
3 个月前