Repository navigation

#

gemm

xlite-dev/CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

Cuda
3487
4 天前

BLISlab: A Sandbox for Optimizing GEMM

C
514
4 年前

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda
390
7 个月前

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

C
347
20 小时前

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda
337
4 个月前

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

Nim
285
1 年前

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

248
1 天前

DBCSR: Distributed Block Compressed Sparse Row matrix library

Fortran
142
5 天前

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

C
142
3 年前

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

Assembly
88
2 天前

A Flexible and Energy Efficient Accelerator For Sparse Convolution Neural Network

Verilog
63
2 个月前

PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu

Cuda
63
5 个月前

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

Cuda
61
7 个月前

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

Cuda
59
1 个月前

Serial and parallel implementations of matrix multiplication

C++
40
4 年前