Repository navigation

image-text-retrieval

Website
Wikipedia

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

image-classification image-text-retrieval 大语言模型 semantic-segmentation video-classification vision-language-model vit-22b vit-6b multi-modal gpt gpt-4v gpt-4o

Python

9285

720

13 天前

OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

chinese 机器视觉 multi-modal-learning 自然语言处理 PyTorch vision-and-language-pre-training image-text-retrieval clip pretrained-models vision-language 深度学习 multi-modal contrastive-loss transformers coreml-models

Jupyter Notebook

5546

521

1 个月前

salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

vision-language vision-and-language-pre-training image-text-retrieval image-captioning visual-question-answering vision-language-transformer

Jupyter Notebook

5509

717

1 年前

slavabarkov / tidy

Offline semantic Text-to-Image and Image-to-Image search on Android powered by quantized state-of-the-art vision-language pretrained CLIP model and ONNX Runtime inference engine

Android clip 机器视觉深度学习 image-retrieval Kotlin 自然语言处理 onnx quantization image-text-retrieval cross-modal-retrieval image-text-matching image-search semantic-search

Kotlin

482

2 年前

greyovo / PicQuery

🔍 Search local images with natural language on Android, powered by OpenAI's CLIP model. / 在 Android 上用自然语言搜索本地图片 (基于 OpenAI 的 CLIP 模型)

Android clip image-text-retrieval material-design-3 openai Jetpack Compose

Kotlin

441

3 个月前

Paranioar / Awesome_Matching_Pretraining_Transfering

The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.

cross-modal-retrieval 教程 Awesome Lists image-text-matching image-text-retrieval large-language-models large-vision-language-models multimodal-pretraining parameter-efficient-fine-tuning vision-and-language multimodal-large-language-models 大语言模型 text-to-image-generation text-to-image-synthesis text-to-video-generation

430

9 天前

Paranioar / SGRAF

[AAAI2021] The code of “Similarity Reasoning and Filtration for Image-Text Matching”

cross-modal-retrieval image-text-matching image-retrieval image-text-retrieval text-matching aaai

Python

219

1 年前

chuhaojin / Text2Poster-ICASSP-22

Official implementation of the ICASSP-2022 paper "Text2Poster: Laying Out Stylized Texts on Retrieved Images"

aigc 深度学习 multimodal-generation 图像处理 image-retrieval artificial-neural-networks PyTorch object-detection image-text-retrieval

Python

213

2 年前

alipay / Ant-Multi-Modal-Framework

Research Code for Multimodal-Cognition Team in Ant Group

image-text-retrieval multimodal-learning video-editing

Python

167

3 个月前

howard-hou / BagFormer

PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction

cross-modal-retrieval image-text-retrieval vision-language

Python

3 年前

X-PLUG / mPLUG

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. (EMNLP 2022)

image-captioning image-text-retrieval multimodal pretraining PyTorch transformer vqa

Python

2 年前

hpc203 / Chinese-CLIP-opencv-onnxrun

使用OpenCV+onnxruntime部署中文clip做以文搜图，给出一句话来描述想要的图片，就能从图库中搜出来符合要求的图片。包含C++和Python两个版本的程序

clip image-text-retrieval opencv-dnn multimodal-large-language-models

C++

2 年前

MILVLG / rosita

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

vision-and-language vqa pre-training image-text-retrieval referring-expression-comprehension

Python

2 年前

cobanov / image-captioning

Image captioning using python and BLIP

image-captioning blip image-text-retrieval vision-language

Python

2 年前

eric-ai-lab / ComCLIP

Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"

blip2 causality clip compositionality image-text-matching image-text-retrieval vision-and-language

Python

1 年前

eric-ai-lab / CPL

Official implementation of our EMNLP 2022 paper "CPL: Counterfactual Prompt Learning for Vision and Language Models"

causal-inference image-classification image-text-retrieval prompt-tuning vision-and-language vqa

Python

3 年前

Paranioar / RCAR

[TIP2023] The code of “Plug-and-Play Regulators for Image-Text Matching”

cross-modal-retrieval image-text-matching image-retrieval image-text-retrieval text-matching tip

Python

1 年前

ytaek-oh / fsc-clip

[EMNLP 2024] Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

image-text-retrieval zero-shot-classification compositionality

Python

1 年前

alipay / PC2-NoiseofWeb

Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text matching/retrieval models.

benchmark cross-modal-retrieval dataset image-text-matching image-text-retrieval multimodal-learning

Python

2 个月前

frank-chris / ImageTextRetrieval

In this work, we implement different cross-modal learning schemes such as Siamese Network, Correlational Network and Deep Cross-Modal Projection Learning model and study their performance. We also propose a modified Deep Cross-Modal Projection Learning model that uses a different image feature extractor. We evaluate the model’s performance on image-text retrieval on a fashion clothing dataset.

image-text-retrieval cross-modal-retrieval cross-modal-learning PyTorch Tensorflow Flask

Jupyter Notebook

4 年前