Repository navigation

#

vlm

huggingface/transformers

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Python
148516
25 分钟前

The Open-sourced Multimodal AI Agent Stack connecting Cutting-edge AI Models and Agent Infra.

TypeScript
17634
3 小时前
Python
17021
9 分钟前

A collection of tutorials on state-of-the-art computer vision models and techniques. Explore everything from foundational architectures like ResNet to cutting-edge models like YOLO11, RT-DETR, SAM 2, Florence-2, PaliGemma 2, and Qwen2.5VL.

Jupyter Notebook
8220
5 天前

The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.

Python
7410
11 小时前

Solve Visual Understanding with Reinforced VLMs

Python
5474
2 个月前

On device AI inference in minutes—now for MLX & GGUF, with Android, iOS and NPU backends coming soon.

Go
4681
7 天前

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

Python
3986
4 个月前

The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention

Python
3122
1 个月前

Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI (Kunlun Inc.), specializing in vision-language reasoning.

Python
2941
17 天前

An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.

Python
2490
10 个月前

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Python
2250
9 个月前
Python
2083
3 小时前

LLM Agent Framework in ComfyUI includes MCP sever, Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai / aisuite interfaces, such as o1,ollama, gemini, grok, qwen, GLM, deepseek, kimi,doubao. Adapted to local llms, vlm, gguf such as llama-3.3 Janus-Pro, Linkage graphRAG

Python
1859
3 天前

A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.).

1608
2 天前

🚀🚀🚀 A collection of some awesome public YOLO object detection series projects and the related object detection datasets.

1558
3 个月前

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Python
1512
7 小时前

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Python
1470
2 天前