Repository navigation

llm-serving

Website
Wikipedia

A high-throughput and memory-efficient inference and serving engine for LLMs

gpt 大语言模型 PyTorch llmops mlops model-serving transformer llm-serving inference llama amd rocm CUDA inferentia trainium tpu xpu hpu deepseek qwen

Python

45263

6937

1 小时前

ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Python

36627

6229

1 小时前

liguodongiot / llm-action

本项目旨在分享大模型相关技术原理以及实战经验（大模型工程化、大模型应用落地）

大语言模型 llm-inference llm-serving llm-training llmops

HTML

16655

1946

6 天前

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

CUDA inference llama llava 大语言模型 llm-serving moe PyTorch transformer vlm llama3 llama3-1 deepseek deepseek-llm deepseek-v3 deepseek-r1 deepseek-r1-zero

Python

13339

1545

1 小时前

bentoml / OpenLLM

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

大语言模型 llmops model-inference fine-tuning llm-serving llama vicuna bentoml llama2 llm-inference llm-ops mistral mlops llama3-1

Python

11153

711

3 天前

skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Python

7693

613

7 小时前

bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!

model-serving mlops llmops generative-ai llm-inference 深度学习 llm-serving 机器学习 Python multimodal ml-engineering 大语言模型

Python

7631

833

2 天前

superduper-io / superduper

Superduper: End-to-end framework for building custom AI applications and agents.

人工智能 mlops torch transformers MongoDB Python PyTorch 机器学习数据库 data inference llm-inference pretrained-models 聊天机器人 semantic-search llm-serving llmops vector-search rag

Python

5033

492

15 小时前

predibase / lorax

Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs

fine-tuning gpt llama 大语言模型 llm-inference llm-serving llmops lora model-serving PyTorch transformers

Python

2954

212

11 小时前

microsoft / aici

AICI: Prompts as (Wasm) Programs

人工智能 Rust WebAssembly wasmtime inference language-model 大语言模型 llm-framework llm-inference llm-serving llmops model-serving transformer

Rust

2017

3 个月前

MoonshotAI / MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

flash-attention 大语言模型 llm-serving llm-training moe PyTorch transformer

Python

1744

104

16 天前

ray-project / ray-llm

RayLLM - LLMs on Ray (Archived). Read README for more info.

ray 大语言模型 llm-serving

Python

1262

1 个月前

thu-pacman / chitu

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

deepseek gpu 大语言模型 PyTorch llm-serving model-serving

Python

1091

38 分钟前

zhihu / ZhiLight

A highly optimized LLM inference acceleration engine for Llama and its variants.

inference-engine 大语言模型 CUDA gpt llama llm-serving PyTorch llm-inference model-serving deepseek-r1

C++

885

104

5 天前

mosecorg / mosec

A high-performance ML model serving framework, offers dynamic batching and CPU/GPU pipelines to fully exploit your compute machine

model-serving 深度学习机器学习 nerual-network mlops Hacktoberfest gpu Python PyTorch Tensorflow 大语言模型 jax llm-serving Rust cv mxnet tts

Python

838

5 天前

efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs

CUDA inference llama2 大语言模型 llm-serving model-serving

Cuda

796

7 个月前

alibaba / rtp-llm

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

gpt inference llama 大语言模型 llm-serving llmops model-serving

C++

702

3 个月前

rohan-paul / LLM-FineTuning-Large-Language-Models

LLM (Large Language Model) FineTuning

gpt-3 gpt3-turbo large-language-models llama2 大语言模型 llm-inference llm-serving llm-training mistral-7b PyTorch

Jupyter Notebook

528

127

17 天前

helixml / helix

🧬 Helix is a private GenAI stack for building AI applications with declarative pipelines, knowledge (RAG), API bindings, and first-class testing.

Go llama 大语言模型 mistral openai 自托管 mixtral sdxl stable-diffusion API finetuning function-calling llm-agent llm-serving OpenAPI Specification rag swagger helm Kubernetes

490

2 小时前

vllm-project / vllm-ascend

Community maintained hardware plugin for vLLM on Ascend

ascend inference 大语言模型 llm-serving llmops mlops model-serving transformer vllm

Python

485

21 小时前