Repository navigation

#

multimodal-large-language-models

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Python
4042
6 个月前

MS-Agent: Lightweight Framework for Empowering Agents with Autonomous Exploration in Complex Task Scenarios

Python
3483
2 天前

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

Python
3073
5 个月前

✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Python
2412
6 个月前

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Python
2249
4 个月前

A cross-platform video structuring (video analysis) framework. If you find it helpful, please give it a star: ) 跨平台的视频结构化(视频分析)框架,觉得有帮助的请给个星星 : )

C++
1833
18 天前

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

Jupyter Notebook
1823
8 个月前

Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.

Jupyter Notebook
1448
4 个月前

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

Python
1364
12 天前

实时语音交互数字人,支持端到端语音方案(GLM-4-Voice - THG)和级联方案(ASR-LLM-TTS-THG)。可自定义形象与音色,无须训练,支持音色克隆,首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

Python
1105
6 个月前
Python
1045
1 年前

Speech, Language, Audio, Music Processing with Large Language Model

Python
896
1 个月前

LLaVA-Plus: Large Language and Vision Assistants that Plug and Learn to Use Skills

Python
759
2 年前

PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models

Python
748
18 天前