Repository navigation
gemm
- Website
- Wikipedia
Fast inference engine for Transformer models
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.
Tuned OpenCL BLAS
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Stretching GPU performance for GEMMs and tensor contractions.
DBCSR: Distributed Block Compressed Sparse Row matrix library
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Serial and parallel implementations of matrix multiplication