Repository navigation
multimodal-models
- Website
- Wikipedia
A curated list of foundation models for vision and language tasks
Awesome Unified Multimodal Models
🔥🔥🔥 A curated list of papers on LLMs-based multimodal generation (image, video, 3D and audio).
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository. Continuous updates.
Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development
A curated list of Awesome Personalized Large Multimodal Models resources
Implementation of the paper "Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning", arXiv, 2025
Multimodal Bi-Transformers (MMBT) in Biomedical Text/Image Classification
NanoOWL Detection System enables real-time open-vocabulary object detection in ROS 2 using a TensorRT-optimized OWL-ViT model. Describe objects in natural language and detect them instantly on panoramic images. Optimized for NVIDIA GPUs with .engine acceleration.
Phi-3-Vision model test - running locally
Leverage VideoLLaMA 3's capabilities using LitServe.
Leverage Gemma 3's capabilities using LitServe.