Repository navigation

#

inference-optimization

BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.

C++
886
8 个月前

The Tensor Algebra SuperOptimizer for Deep Learning

C++
730
3 年前

Everything you need to know about LLM inference

TypeScript
217
2 天前

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

C++
201
3 年前

Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.

Python
197
5 年前

Optimize layers structure of Keras model to reduce computation time

Python
157
5 年前

A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3

Python
80
6 年前

Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)

Python
64
5 个月前

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Python
41
3 个月前

The blog, read report and code example for AGI/LLM related knowledge.

Python
40
7 个月前

Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.

Jupyter Notebook
17
1 年前

cross-platform modular neural network inference library, small and efficient

C++
13
2 年前

A template for getting started writing code using GGML

C++
10
1 年前

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

Python
10
2 个月前

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢

Python
10
8 个月前

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

Python
6
9 个月前

Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.

Python
6
2 个月前

Your AI Catalyst: inference backend to maximize your model's inference performance

C++
5
8 个月前

A constrained expectation-maximization algorithm for feasible graph inference.

Jupyter Notebook
4
4 年前