Repository navigation

multimodal-large-language-models

Website
Wikipedia

BradyFU / Awesome-Multimodal-Large-Language-Models

✨✨Latest Advances on Multimodal Large Language Models

instruction-tuning instruction-following large-vision-language-model visual-instruction-tuning multi-modality in-context-learning large-language-models large-vision-language-models multimodal-chain-of-thought multimodal-in-context-learning multimodal-large-language-models chain-of-thought

16387

1063

11 天前

X-PLUG / MobileAgent

Mobile-Agent: The Powerful GUI Agent Family

agent mllm mobile-agents multimodal multimodal-large-language-models multimodal-agent Android App GUI 移动自动化 copilot

Python

5946

576

7 天前

joanrod / star-vector

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

大语言模型 multimodal-large-language-models SVG vlm

Python

4042

219

6 个月前

modelscope / ms-agent

MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios

agent gpts 大语言模型 qwen open-gpts multi-agents assistantapi 聊天机器人 multimodal-large-language-models rag Code 数据科学 deep-research

Python

3483

398

2 天前

ictnlp / LLaMA-Omni

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

large-language-models multimodal-large-language-models speech-to-text

Python

3073

216

5 个月前

VITA-MLLM / VITA

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

large-multimodal-models multimodal-large-language-models

Python

2412

176

6 个月前

X-PLUG / mPLUG-DocOwl

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

chart-understanding document-understanding mllm multimodal multimodal-large-language-models table-understanding

Python

2249

129

4 个月前

cambrian-mllm / cambrian

Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

聊天机器人 clip 机器视觉 dino instruction-tuning large-language-models 大语言模型 mllm multimodal-large-language-models representation-learning

Python

1954

131

1 年前

sherlockchou86 / VideoPipe

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化（视频分析）框架，觉得有帮助的请给个星星 : )

C++

1833

267

18 天前

YangLing0818 / RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

large-language-models multimodal-large-language-models image-editting text-to-image

Jupyter Notebook

1823

100

8 个月前

ByteDance-Seed / Seed1.5-VL

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

cookbook 大语言模型 multimodal-large-language-models vision-language-model

Jupyter Notebook

1448

4 个月前

AIDC-AI / Ovis

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

聊天机器人 llama3 multimodal multimodal-large-language-models multimodality qwen vision-language-model

Python

1364

12 天前

Henry-23 / VideoChat

实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

dialogue-systems real-time digital-human lip-sync musetalk streaming talking-head asr tts end-to-end multimodal-large-language-models

Python

1105

144

6 个月前