Repository navigation

#

llm-serving

Python
55781
39 分钟前

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML
20209
17 天前
Python
17026
2 小时前

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

Python
11700
2 天前

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

C++
11370
9 小时前
Python
8533
6 小时前
bentoml/BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

Python
7996
2 天前

High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle

Python
3455
19 小时前

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

Python
3388
3 个月前

MoBA: Mixture of Block Attention for Long-Context LLMs

Python
1864
5 个月前

RayLLM - LLMs on Ray (Archived). Read README for more info.

Python
1263
5 个月前

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python
1239
2 天前

Community maintained hardware plugin for vLLM on Ascend

Python
1016
1 天前

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++
900
1 个月前

A throughput-oriented high-performance serving framework for LLMs

Jupyter Notebook
872
8 天前

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

Python
852
19 天前