Repository navigation

vision-language-pretraining

Website
Wikipedia

deepseek-ai / Janus

Janus-Series: Unified Multimodal Understanding and Generation Models

any-to-any foundation-models 大语言模型 multimodal vision-language-pretraining unified-model

Python

17558

2242

8 个月前

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

深度学习 deep-learning-library image-captioning salesforce vision-and-language vision-framework vision-language-pretraining vision-language-transformer visual-question-anwsering multimodal-datasets multimodal-deep-learning

Jupyter Notebook

10933

1067

1 年前

deepseek-ai / DeepSeek-VL

DeepSeek-VL: Towards Real-World Vision-Language Understanding

vision-language-model vision-language-pretraining foundation-models

Python

3966

581

1 年前

DAMO-NLP-SG / Video-LLaMA

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

large-language-models video-language-pretraining vision-language-pretraining blip2 llama minigpt4 cross-modal-pretraining multi-modal-chatgpt

Python

3078

281

1 年前

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

聊天机器人 clip gpt-4 llama llava vicuna vision-language vision-language-pretraining

Python

1441

120

2 个月前

Sense-GVT / DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

big-model clip multi-model self-supervised vision-language-pretraining zero-shot

Python

666

3 年前

TXH-mercury / VALOR

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

vision-language-pretraining

Python

303

9 个月前

mbzuai-oryx / VideoGPT-plus

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

聊天机器人 clip gpt4 gpt4o llama3 llava multimodal vicuna vision-language vision-language-pretraining

Python

286

2 个月前

sail-sg / ptp

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

cross-modality vision-language-pretraining

Python

152

2 年前

jusiro / FLAIR

[MedIA'25] FLAIR: A Foundation LAnguage-Image model of the Retina for fundus image understanding.

foundation-models Medical imaging vision-language-pretraining

Python

146

3 个月前

Fr0zenCrane / UniCoT

Uni-CoT: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

cot multimodal unified-model any-to-any 大语言模型人工智能深度学习 vision-language-pretraining chain-of-thought 机器视觉

Python

146

10 天前

BridgeVLA / BridgeVLA

✨✨【NeurIPS 2025】Official implementation of BridgeVLA

embodied-ai Robotics vision-language-pretraining

Python

137

14 天前

Surrey-UP-Lab / RegionSpot

Recognize Any Regions

auto-labeling instance-segmentation object-detection open-world vision-language-model vision-language-pretraining zero-shot

Python

121

10 个月前

vgthengane / Continual-CLIP

Official repository for "CLIP model is an Efficient Continual Learner".

clip continual-learning vision-language-pretraining foundational-models baseline

Python

101

3 年前

ArrowLuo / SegCLIP

PyTorch implementation of ICML 2023 paper "SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation"

semantic-segmentation transfer-learning vision-language-pretraining contrastive-learning

Python

2 年前

HieuPhan33 / CVPR2024_MAVL

Multi-Aspect Vision Language Pretraining - CVPR2024

vision-language-model vision-language-pretraining zero-shot-classification zero-shot-segmentation

Python

1 年前

marslanm / Multimodality-Representation-Learning

This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://dl.acm.org/doi/abs/10.1145/3617833 .

cross-modal multimodal-datasets multimodal-deep-learning multimodal-pre-trained-model transformer-models vision-language-pretraining

4 个月前