Repository navigation

#

lmm

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2073
5 个月前

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Python
864
5 个月前

Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs

Python
661
2 天前

LLaVA-Interactive-Demo

Python
368
9 个月前

[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Python
280
5 个月前

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Python
256
1 年前

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".

Python
246
4 个月前

Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR2024]

Python
214
1 个月前

A RLHF Infrastructure for Vision-Language Models

Python
171
5 个月前

😎 curated list of awesome LMM hallucinations papers, methods & resources.

150
1 年前

🤖 Discord AI assistant with OpenAI, Gemini, Claude & DeepSeek integration, multilingual support, multimodal chat, image generation, web search, and deep thinking | 一个强大的 Discord AI 助手,整合多种顶级 AI 模型,支持多语言、多模态交流、图片生成、联网搜索和深度思考

JavaScript
147
2 个月前

[ICLR 2025] What do we expect from LMMs as AIGI evaluators and how do they perform?

144
3 个月前

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Python
113
1 年前

🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant

Python
94
25 天前

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.

65
4 个月前

[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Python
58
6 天前

LLaVA inference with multiple images at once for cross-image analysis.

Python
50
1 年前

[COLING 2025] Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Jupyter Notebook
46
3 个月前

LMM solved catastrophic forgetting, AAAI2025

Python
40
5 天前