Repository navigation

#

large-multimodal-models

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python
2412
6 个月前
OpenAdaptAI/OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Python
1394
7 个月前

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Python
1351
3 个月前

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"

Python
1076
1 年前
Python
904
5 个月前

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python
758
2 年前

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

Python
529
3 个月前

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Python
374
1 年前
Python
346
4 个月前

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

Python
337
7 个月前

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Python
196
1 年前

Embed arbitrary modalities (images, audio, documents, etc) into large language models.

Python
187
2 年前