Repository navigation

#

lmm

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2252
9 个月前

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Python
906
16 天前

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Python
853
12 天前

LLaVA-Interactive-Demo

Python
377
1 年前

[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Python
297
9 个月前

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025

Python
264
3 个月前

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Python
257
16 天前

🤖 Discord AI assistant with OpenAI, Gemini, Claude & DeepSeek integration, multilingual support, multimodal chat, image generation, web search, and deep thinking | 一个强大的 Discord AI 助手,整合多种顶级 AI 模型,支持多语言、多模态交流、图片生成、联网搜索和深度思考

JavaScript
233
6 个月前

Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]

Python
225
5 个月前

A RLHF Infrastructure for Vision-Language Models

Python
181
9 个月前

😎 curated list of awesome LMM hallucinations papers, methods & resources.

149
1 年前

[ICLR 2025] What do we expect from LMMs as AIGI evaluators and how do they perform?

141
7 个月前

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Python
130
1 年前

🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant

Python
112
5 个月前

[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Python
78
4 个月前

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.

70
8 个月前

Official implementation of "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology"

Python
53
1 个月前

LLaVA inference with multiple images at once for cross-image analysis.

Python
51
1 年前

[COLING 2025] Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Jupyter Notebook
51
7 个月前