Repository navigation

qwen2-vl

Website
Wikipedia

A community-driven AI automation framework that builds upon the incredible work of the open source community. Our goal is to combine language models with specialized tools for tasks like web search, crawling, and Python code execution, while giving back to the community that made this possible.

agi 自动化 deep-research langchain langgraph 大语言模型 qwen qwen2-vl agent agents 人工智能 multi-agent multi-agent-systems deepseek deepseek-r1

Python

5128

543

6 个月前

roboflow / maestro

streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL

captioning fine-tuning florence-2 multimodal objectdetection paligemma phi-3-vision transformers vision-and-language vqa qwen2-vl

Python

2631

217

5 天前

2U1 / Qwen2-VL-Finetune

An open-source implementaion for fine-tuning Qwen2-VL and Qwen2.5-VL series by Alibaba Cloud.

聊天机器人 multimodal qwen2-vl vision-language vision-language-model qwen2-5

Python

1223

155

3 天前

PaddlePaddle / PaddleMIX

Paddle Multimodal Integration and eXploration, supporting mainstream multi-modal tasks, including end-to-end large-scale multi-modal pretrain models and diffusion model toolbox. Equipped with high performance and flexibility.

aigc stable-diffusion clip image-to-text text-to-image controlnet multimodal text-to-video dit llava sora qwen2-vl minicpm-v

Python

700

221

1 个月前

lucasjinreal / Crane

A Pure Rust based LLM (Any LLM based MLLM such as Spark-TTS) Inference Engine, powering by Candle framework.

llama-cpp mllm qwen2-vl Rust qwen3

Rust

163

11 天前

NetEase-Media / grps_trtllm

Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.

大语言模型 openai tensorrt-llm chatglm llama3 qwen2 function-call ai-agent llama-index multi-modal deepseek-r1 phi qwq qwen2-vl minicpm-v internvl qwen3

Python

155

5 个月前

worldbench / drivebench

[ICCV 2025] Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

autonomous-driving ChatGPT internvl qwen2-vl

Python

114

2 个月前

arcstep / illufly

✨🦋 illufly - 【幻蝶】基于记忆蒸馏、资料检索的自我进化智能体

agent 人工智能 glm-4 gpt 大语言模型 multiagent openai qwen qwen2 qwen2-vl rag growth

Python

4 个月前

col14m / cadrille

cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

cad 大语言模型 PyTorch qwen2-vl vlm

Python

16 天前

soulteary / dify-with-qwen-vl

视频理解：千问视频多模态模型 & Dify

dify qwen2 qwen2-vl

Python

1 年前

fireicewolf / wd-llm-caption-cli

A Python base cli tool for caption images with WD series, Joy-caption-pre-alpha,meta Llama 3.2 Vision Instruct and Qwen2 VL Instruct models.

qwen2-vl florence-2

Python

10 天前

Younis-Ahmed / qwen-ai-provider

Community-built Qwen AI Provider for Vercel AI SDK - Integrate Alibaba Cloud's Qwen models with Vercel's AI application framework

人工智能 vercel-ai vercel-ai-sdk qwen qwen2-5 qwen2-vl generative-ai Vercel alibaba-cloud language-model

TypeScript

6 天前

see2023 / autoXHS

基于多模态大模型的智能搜索助手，通过AI技术实现小红书平台的智能化信息检索和知识整合|An intelligent search assistant based on multimodal large models, enabling smart information retrieval and knowledge integration on the Xiaohongshu platform.

大语言模型 qwen2-vl Selenium xiaohongshu spider

Python

1 年前

shaadclt / Qwen2-VL-OCR-VQA

This project demonstrates how to use the Qwen2-VL model from Hugging Face for Optical Character Recognition (OCR) and Visual Question Answering (VQA). The model combines vision and language capabilities, enabling users to analyze images and generate context-based responses.

optical-character-recognition qwen2-vl visual-question-answering

Jupyter Notebook

1 年前

BUAADreamer / Qwen2-VL-History

Qwen2-VL在文旅领域的LLaMA-Factory微调案例 The case for fine-tuning Qwen2-VL in the field of historical literature and museums

history mllm multimodal-large-language-models qwen2-vl

1 年前

zhangguanghao523 / CMMCoT

Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

chain-of-thought cot mllm qwen2-vl

Python

5 个月前

aws-samples / sample-for-multi-modal-document-to-json-with-sagemaker-ai

This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.

Amazon Web Services document-processing fine-tuning huggingface idp llama multimodal qwen2-vl sagemaker sft Swift

Jupyter Notebook

2 个月前