Repository navigation
gemm
- Website
- Wikipedia
Fast inference engine for Transformer models
Tuned OpenCL BLAS
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Multi-Threaded FP32 Matrix Multiplication on x86 CPUs
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers
[DEPRECATED] Moved to ROCm/rocm-libraries repo
DBCSR: Distributed Block Compressed Sparse Row matrix library
[DEPRECATED] Moved to ROCm/rocm-libraries repo
A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network
Serial and parallel implementations of matrix multiplication
The simplest but fast implementation of matrix multiplication in CUDA.