Repository navigation

#

vision-and-language

A one stop repository for generative AI research updates, interview resources, notebooks and much more!

11736
19 天前

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

Python
2547
6 天前

Code for ALBEF: a new vision-language pre-training method

Python
1634
3 年前

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Python
1451
1 年前

Real-time and accurate open-vocabulary end-to-end object detection

Python
1313
4 个月前

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

Python
1310
1 年前

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

1152
3 年前

Codebase for Aria - an Open Multimodal Native MoE

Jupyter Notebook
1029
3 个月前

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Python
1029
6 个月前

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Python
970
2 年前

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

Python
864
5 个月前

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Jupyter Notebook
805
9 个月前

Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

Python
792
4 年前