Repository navigation

vision-language

Website
Wikipedia

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

object-detection open-world open-world-detection vision-language vision-language-transformer

Python

8995

920

1 年前

OFA-Sys / Chinese-CLIP

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

chinese 机器视觉 multi-modal-learning 自然语言处理 PyTorch vision-and-language-pre-training image-text-retrieval clip pretrained-models vision-language 深度学习 multi-modal contrastive-loss transformers coreml-models

Jupyter Notebook

5546

521

1 个月前

salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

vision-language vision-and-language-pre-training image-text-retrieval image-captioning visual-question-answering vision-language-transformer

Jupyter Notebook

5509

717

1 年前

marqo-ai / marqo

Unified embedding generation and search engine. Also available on cloud - cloud.marqo.ai

深度学习 information-retrieval 机器学习 vector-search tensor-search clip multi-modal search-engine transformers vision-language semantic-search visual-search 自然语言处理 hnsw knn Hacktoberfest ChatGPT gpt large-language-models

Python

4971

216

3 天前

OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

multimodal pretraining image-captioning text-to-image-synthesis visual-question-answering referring-expression-comprehension vision-language pretrained-models prompt prompt-tuning chinese

Python

2536

248

1 年前

AlibabaResearch / AdvancedLiterateMachinery

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

C++

1780

200

6 个月前

mbzuai-oryx / Video-ChatGPT

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

聊天机器人 clip gpt-4 llama llava vicuna vision-language vision-language-pretraining

Python

1441

120

2 个月前

llm-jp / awesome-japanese-llm

日本語LLMまとめ - Overview of Japanese LLMs

language-model language-models 大语言模型 large-language-models japanese japanese-language vision-and-language foundation-models multimodal vision-language vision-language-model generative-ai generative-model generative-models

TypeScript

1234

21 天前

2U1 / Qwen2-VL-Finetune

An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

聊天机器人 multimodal qwen2-vl vision-language vision-language-model qwen2-5

Python

1223

155

3 天前

OpenDriveLab / DriveLM

[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering

autonomous-driving large-language-models vision-language chain-of-thought graph-of-thoughts 大语言模型 prompting tree-of-thoughts prompt-engineering

HTML

1171

3 个月前

OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

foundation-models multimodal representation-learning vision-language audio-language vision-and-language vision-transformer contrastive-loss

Python

1050

1 年前

google-research / pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

object-detection 机器视觉 vision-language 深度学习 tensorflow2

Jupyter Notebook

929

2 年前

TinyLLaVA / TinyLLaVA_Factory

A Framework of Small-scale Large Multimodal Models

large-multimodal-models llama llava 自然语言处理 transformers vision-language

Python

904

5 个月前

SunzeY / AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

深度学习机器学习 vision-language vision-language-model vision-transformer vision-and-language

Jupyter Notebook

844

2 个月前

mbzuai-oryx / LLaVA-pp

🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)

conversation llama3 llava 大语言模型 lmms phi3 vision-language llama-3-llava llama-3-vision llama3-llava phi-3-vision phi3-vision

Python

840

2 个月前

Algolzw / daclip-uir

[ICLR 2024] Controlling Vision-Language Models for Universal Image Restoration. 5th place in the NTIRE 2024 Restore Any Image Model in the Wild Challenge.

diffusion-models image-restoration prompt vision-language image-deblurring image-denoising image-deraining low-level-vision PyTorch 深度学习

Python

787

1 年前