Repository navigation

vision-and-language

Website
Wikipedia

aishwaryanr / awesome-generative-ai-guide

A one stop repository for generative AI research updates, interview resources, notebooks and much more!

Awesome Lists generative-ai 面试 large-language-models 大语言模型 notebook-jupyter vision-and-language

19028

4101

5 天前

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

深度学习 deep-learning-library image-captioning salesforce vision-and-language vision-framework vision-language-pretraining vision-language-transformer visual-question-anwsering multimodal-datasets multimodal-deep-learning

Jupyter Notebook

10933

1067

1 年前

roboflow / maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision transformers vision-and-language vqa qwen2-vl

Python

2631

217

5 天前

om-ai-lab / OmAgent

Build multimodal language agents for fast prototype and production

large-language-models multimodal-agent vision-and-language agent workflow 聊天机器人 gpt4 大语言模型 multimodal rag vlm gpt gradio llama llava openai Python gemini

Python

2555

281

7 个月前

salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method

vision-and-language representation-learning contrastive-learning

Python

1713

215

3 年前

open-mmlab / Multimodal-GPT

Multimodal-GPT

flamingo gpt gpt-4 llama multimodal transformer vision-and-language

Python

1507

133

2 年前

dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

vision-and-language

Python

1497

226

2 年前

om-ai-lab / OmDet

Real-time and accurate open-vocabulary end-to-end object detection

object-detection vision-and-language zero-shot-object-detection 机器视觉 zero-shot coco real-time

Python

1341

109

10 个月前

NVlabs / prismer

The implementation of "Prismer: A Vision-Language Model with Multi-Task Experts".

image-captioning language-model multi-modal-learning multi-task-learning vision-language-model vision-and-language vqa

Python

1305

2 年前

llm-jp / awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

language-model language-models 大语言模型 large-language-models japanese japanese-language vision-and-language foundation-models multimodal vision-language vision-language-model generative-ai generative-model generative-models

TypeScript

1234

21 天前

yuewang-cuhk / awesome-vision-language-pretraining-papers

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

vision-and-language pretraining multimodal-deep-learning bert

1156

104

3 年前

rhymes-ai / Aria

Codebase for Aria - an Open Multimodal Native MoE

mixture-of-experts multimodal vision-and-language

Jupyter Notebook

1070

8 个月前

OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

foundation-models multimodal representation-learning vision-language audio-language vision-and-language vision-transformer contrastive-loss

Python

1050

1 年前

microsoft / Oscar

Oscar and VinVL

vision-and-language pre-training image-captioning vqa oscar

Python

1049

250

2 年前

YehLi / xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

image-captioning video-captioning vision-and-language pretraining cross-modal-retrieval visual-question-answering tden

Python

969

106

3 年前

mbzuai-oryx / groundingLMM

[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-its-kind model capable of generating natural language responses that are seamlessly integrated with object segmentation masks.

foundation-models lmm vision-and-language vision-language-model llm-agent

Python

918

2 个月前