Repository navigation

#

vision-and-language

A one stop repository for generative AI research updates, interview resources, notebooks and much more!

14107
6 天前

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

Python
2628
1 天前

Code for ALBEF: a new vision-language pre-training method

Python
1692
3 年前

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Python
1488
1 年前

Real-time and accurate open-vocabulary end-to-end object detection

Python
1333
8 个月前

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Python
1309
2 年前

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

1155
3 年前

Codebase for Aria - an Open Multimodal Native MoE

Jupyter Notebook
1068
7 个月前

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Python
1050
10 个月前

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Python
968
2 年前

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Python
906
15 天前

[ECCV 2024 Best Paper Candidate & TPAMI 2025] PointLLM: Empowering Large Language Models to Understand Point Clouds

Python
865
6 天前

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Jupyter Notebook
836
1 个月前