Repository navigation

large-multimodal-models

Website
Wikipedia

VITA-MLLM / VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

large-multimodal-models multimodal-large-language-models

Python

2412

176

6 个月前

OpenAdaptAI / OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models

Python transformers large-language-models large-multimodal-models huggingface segment-anything large-action-model agents ai-agents ai-agents-framework anthropic google-gemini openai ultralytics computer-use gpt4o omniparser

Python

1394

199

7 个月前

NVlabs / describe-anything

[ICCV 2025] Implementation for Describe Anything: Detailed Localized Image and Video Captioning

large-multimodal-models vision-language-model

Python

1351

3 个月前

ShareGPT4Omni / ShareGPT4Video

[NeurIPS 2024] An official implementation of "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions"

ChatGPT gpt gpt-4v large-language-models large-multimodal-models large-vision-language-models sora text-to-video

Python

1076

1 年前

TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

large-multimodal-models llama llava 自然语言处理 transformers vision-language

Python

904

5 个月前

richard-peng-xia / awesome-multimodal-in-medical-imaging

A collection of resources on applications of multi-modal learning in medical imaging.

Medical imaging multimodal-deep-learning multimodal-learning visual-question-answering large-language-models large-multimodal-models multimodal-large-language-models

839

1 个月前

LLaVA-VL / LLaVA-Plus-Codebase

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

agent large-language-models large-multimodal-models multimodal-large-language-models tool-use

Python

758

2 年前

ictnlp / LLaVA-Mini

LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.

efficient gpt4o gpt4v large-language-models large-multimodal-models llava multimodal Video vision vision-language-model visual-instruction-tuning llama multimodal-large-language-models

Python

529

3 个月前

MMMU-Benchmark / MMMU

This repo contains evaluation code for the paper "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI"

机器视觉深度学习深度神经网络 evaluation foundation-models large-language-models large-multimodal-models 大语言模型机器学习 multimodal multimodal-deep-learning multimodal-learning multimodality 自然语言处理 question-answering STEM visual-question-answering

Python

501

5 个月前

xiaoachen98 / Open-LLaVA-NeXT

An open-source implementation for training LLaVA-NeXT.

聊天机器人 ChatGPT gpt-4 gpt4o large-multimodal-models llama llama3 llava multi-modality multimodal vision-language-model visual-language-learning

Python

422

1 年前

shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

large-multimodal-models llama multimodal vision-language-model 聊天机器人 ChatGPT gpt-4

Python

374

1 年前

ictnlp / Stream-Omni

Stream-Omni is a GPT-4o-like language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.

聊天机器人 gpt-4o large-language-models large-multimodal-models llama 大语言模型 speech-recognition speech-synthesis vision-language-model interaction speech vision asr question-answering tts ChatGPT speech-to-text

Python

346

4 个月前

zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

finetuning foundation-models instruction-tuning 大语言模型 large-multimodal-models multimodal multimodal-large-language-models vision-language visual-instruction-tuning llava

Python

337

7 个月前