Repository navigation

#

multimodal-large-language-models

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Python
3621
4 天前

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python
2891
2 小时前

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python
2235
23 天前

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python
2157
4 个月前

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Jupyter Notebook
1793
3 个月前
Python
1014
5 个月前

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Python
891
25 天前

实时语音交互数字人,支持端到端语音方案(GLM-4-Voice - THG)和级联方案(ASR-LLM-TTS-THG)。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

Python
885
1 个月前

Speech, Language, Audio, Music Processing with Large Language Model

Python
783
7 天前

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python
739
1 年前
Python
659
3 天前

✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models

Python
634
4 个月前

[CVPR 2024] MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Python
608
3 个月前

NeurIPS 2024 Paper: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Python
525
6 个月前

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

525
3 天前