Repository navigation

#

lmm

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2295
1 年前

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Python
918
2 个月前

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Python
876
2 个月前

LLaVA-Interactive-Demo

Python
378
1 年前

[CVPR'24] HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

Python
300
1 年前

The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM", IJCV2025

Python
269
4 个月前

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Python
257
2 个月前

Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024]

Python
227
6 个月前

🤖 Discord AI assistant with OpenAI, Gemini, Claude & DeepSeek integration, multilingual support, multimodal chat, image generation, web search, and deep thinking | 一个强大的 Discord AI 助手,整合多种顶级 AI 模型,支持多语言、多模态交流、图片生成、联网搜索和深度思考

JavaScript
223
7 个月前

A RLHF Infrastructure for Vision-Language Models

Python
184
1 年前

😎 curated list of awesome LMM hallucinations papers, methods & resources.

149
2 年前

[ICLR 2025] What do we expect from LMMs as AIGI evaluators and how do they perform?

139
8 个月前

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Python
133
1 年前

🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant

Python
114
6 个月前

[CVPR 2025 🔥]A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Python
86
6 个月前

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.

72
10 个月前

Official implementation of "Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology"

Python
64
3 个月前

[ICCV'25] Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Python
56
2 个月前

LLaVA inference with multiple images at once for cross-image analysis.

Python
51
2 年前