Repository navigation
fp8
- Website
- Wikipedia
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference.
An innovative library for efficient LLM inference via low-bit quantization
Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.
JAX Scalify: end-to-end scaled arithmetics
A modular, accelerator-ready machine learning framework built in Go that speaks float8/16/32/64. Designed with clean architecture, strong typing, and native concurrency for scalable, production-ready AI systems. Ideal for engineers who value simplicity, speed, and maintainability.
Python implementations for multi-precision quantization in computer vision and sensor fusion workloads, targeting the XR-NPE Mixed-Precision SIMD Neural Processing Engine. The code includes visual inertial odometry (VIO), object classification, and eye gaze extraction code in FP4, FP8, Posit4, Posit8, and BF16 formats.