Repository navigation
image-text-matching
- Website
- Wikipedia
Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.
Offline semantic Text-to-Image and Image-to-Image search on Android powered by quantized state-of-the-art vision-language pretrained CLIP model and ONNX Runtime inference engine
The Paper List of Large Multi-Modality Model (Perception, Generation, Unification), Parameter-Efficient Finetuning, Vision-Language Pretraining, Conventional Image-Text Matching for Preliminary Insight.
[AAAI2021] The code of “Similarity Reasoning and Filtration for Image-Text Matching”
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021 (Oral)
Code for journal paper "Learning Dual Semantic Relations with Graph Attention for Image-Text Matching", TCSVT, 2020.
Extended COCO Validation (ECCV) Caption dataset (ECCV 2022)
Easy wrapper for inserting LoRA layers in CLIP.
Implementation of the "Learn No to Say Yes Better" paper.
Official implementation and dataset for the NAACL 2024 paper "ComCLIP: Training-Free Compositional Image and Text Matching"
A non-JIT version implementation / replication of CLIP of OpenAI in pytorch
[TIP2023] The code of “Plug-and-Play Regulators for Image-Text Matching”
Code implementation of paper "SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval".
[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark containing 100K image-text pairs for robust image-text matching/retrieval models.
Text Query based Traffic Video Event Retrieval with Global-Local Fusion Embedding
A dead-simple image search / retrieval and image-text matching system for Bangla using CLIP
[TIP2024] The code of “Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching”
CLIP (Contrastive Language–Image Pre-training) for Bangla.
Unofficial code of paper "Improving description-based person re-identification by multi-granularity image-text alignment." by Niu et al. (partially implemented)