Repository navigation
mllm
- Website
- Wikipedia
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
SpatialLM: Large Language Model for Spatial Understanding
Reasoning in LLMs: Papers and Resources, including Chain-of-Thought, OpenAI o1, and DeepSeek-R1 🍓
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
Agent S: an open agentic framework that uses computers like a human
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Pioneering Multimodal Reasoning with CoT
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
A family of lightweight multimodal models.
[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
🚀🚀🚀A collection of some wesome public projects about Large Language Model(LLM), Vision Language Model(VLM), Vision Language Action(VLA), AI Generated Content(AIGC), the related Datasets and Applications.
✨✨Woodpecker: Hallucination Correction for Multimodal Large Language Models
OpenEMMA, a permissively licensed open source "reproduction" of Waymo’s EMMA model.
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization