Repository navigation

#

vision-language-pretraining

Janus-Series: Unified Multimodal Understanding and Generation Models

Python
17558
8 个月前

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Python
3966
1 年前

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Python
3078
1 年前

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Python
1441
2 个月前

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Python
666
3 年前

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Python
303
9 个月前

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Python
286
2 个月前

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

Python
152
2 年前

[MedIA'25] FLAIR: A Foundation LAnguage-Image model of the Retina for fundus image understanding.

Python
146
3 个月前

✨✨【NeurIPS 2025】Official implementation of BridgeVLA

Python
137
14 天前

Official repository for "CLIP model is an Efficient Continual Learner".

Python
101
3 年前

PyTorch implementation of ICML 2023 paper "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation"

Python
96
2 年前

This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://dl.acm.org/doi/abs/10.1145/3617833 .

81
4 个月前

Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models. [ICCV 2023 Oral]

Python
63
2 年前

📍 Official pytorch implementation of paper "ProtoCLIP: Prototypical Contrastive Language Image Pretraining" (IEEE TNNLS)

Python
53
2 年前

[ICLR2024] Codes and Models for COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Python
43
9 个月前