Repository navigation
int8
- Website
- Wikipedia
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
An innovative library for efficient LLM inference via low-bit quantization
Reimplement RetinaFace use C++ and TensorRT
a simple pipline of int8 quantization based on tensorrt.
👀 Apply YOLOv8 exported with ONNX or TensorRT(FP16, INT8) to the Real-time camera
INT8 calibrator for ONNX model with dynamic batch_size at the input and NMS module at the output. C++ Implementation.
A LLaMA2-7b chatbot with memory running on CPU, and optimized using smooth quantization, 4-bit quantization or Intel® Extension For PyTorch with bfloat16.
LLM-Lora-PEFT_accumulate explores optimizations for Large Language Models (LLMs) using PEFT, LORA, and QLORA. Contribute experiments and implementations to enhance LLM efficiency. Join discussions and push the boundaries of LLM optimization. Let's make LLMs more efficient together!
quantization example for pqt & qat
VB.NET api wrapper for llm-inference chatllm.cpp