Repository navigation
multimodal-datasets
- Website
- Wikipedia
LAVIS - A One-stop Library for Language-Vision Intelligence
Compose multimodal datasets 🎹
This repository is build in association with our position paper on "Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers". As a part of this release we share the information about recent multimodal datasets which are available for research purposes. We found that although 100+ multimodal language resources are available in literature for various NLP tasks, still publicly available multimodal datasets are under-explored for its re-usage in subsequent problem domains.
Pytorch implementation of Multimodal Fusion Transformer for Remote Sensing Image Classification.
[NeurIPS 2023 Oral] Quilt-1M: One Million Image-Text Pairs for Histopathology.
500,000 multimodal short video data and baseline models. 50万条多模态短视频数据集和基线模型(TensorFlow2.0)。
Code from the paper "Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models"
This repository provides a comprehensive collection of research papers focused on multimodal representation learning, all of which have been cited and discussed in the survey just accepted https://dl.acm.org/doi/abs/10.1145/3617833 .
Code and data to evaluate LLMs on the ENEM, the main standardized Brazilian university admission exams.
[Paperlist] Awesome paper list of multimodal dialog, including methods, datasets and metrics
Real-world photo sequence question answering system (MemexQA). CVPR'18 and TPAMI'19
Millions-Level Face/Human-Scene Image-Text Datasets
Collects a multimodal dataset of Wikipedia articles and their images
[ICCV 2025] Official repository of "Mitigating Object Hallucinations via Sentence-Level Early Intervention".
Vision-Language Models Toolbox: Your all-in-one solution for multimodal research and experimentation
Data and code of the Findings of EMNLP'23 paper MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields
Towards Explainable Multimodal Depression Recognition for Clinical Interviews
Official Git repository for "Hakimov, S., and Schlangen, D., (2023). Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks. Findings of the Association for Computational Linguistics (ACL 2023 Findings)"
Pre-Processing of Annotated Music Video Corpora (COGNIMUSE and DEAP)
Image Recommendation for Wikipedia Articles