Repository navigation

gemm

Website
Wikipedia

Fast inference engine for Transformer models

neural-machine-translation C++mkl quantization CUDA thrust opennmt 深度神经网络 openmp onednn intrinsics avx2 avx parallel-computing gemm neon transformer-models machine-translation 深度学习 inference

C++

3751

350

11 天前

xlite-dev / CUDA-Learn-Notes

📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels🎉, HGEMM, FA2 via MMA and CuTe, 98~100% TFLOPS of cuBLAS/FA2.

CUDA gemm cuda-kernels cuda-programming cudnn cutlass flash-attention

Cuda

3487

377

4 天前

flame / how-to-optimize-gemm

gemm matrix-multiplication blis

1862

357

2 年前

CNugteren / CLBlast

Tuned OpenCL BLAS

blas opencl blas-libraries matrix-multiplication gemm gpu

C++

1096

204

21 小时前

flame / blislab

BLISlab: A Sandbox for Optimizing GEMM

gemm matrix-multiplication blis

514

107

4 年前

Bruce-Lee-LY / cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

CUDA gemm cublas Nvidia gpu

Cuda

390

7 个月前

salykova / matmul.c

Multi-Threaded FP32 Matrix Multiplication on x86 CPUs

C gemm matrix-multiplication openmp cpu

347

20 小时前

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

CUDA gemm Nvidia optimization

Cuda

337

4 个月前

mratsim / laser

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

high-performance-computing 深度学习 blas gemm convolution jit Assembly simd openmp tensor parallel matrix-multiplication

Nim

285

1 年前

coderonion / awesome-cuda-and-hpc

🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.

CUDA cublas tensorrt Awesome Lists 大语言模型 gpu blas PyTorch hpc gemm llama cudnn triton tensorrt-llm cutlass mlir tvm deepseek ptx vlm

248

1 天前

ROCm / Tensile

Stretching GPU performance for GEMMs and tensor contractions.

gemm blas dnn neural-networks 机器学习 tensors Python opencl hip auto-tuning amd gpu-computing gpu-acceleration gpu matrix-multiplication Assembly

Python

235

158

3 天前

cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library

blas matrix-multiplication gemm CUDA sparse-matrix mpi hpc linear-algebra

Fortran

142

5 天前

yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

blas gemm avx512 simd mkl openmp

142

3 年前

yui0 / slibs

Single file libraries for C/C++

C single-header-lib audio flac mp3 gpgpu mpeg mp4 m4a aac glsl opencl gemm blas ascii codec encoder 数学 alsa kms

121

8 个月前

ROCm / hipBLASLt

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

amd Assembly blas gemm gpu-computing hip 机器学习 matrix-multiplication rocm

Assembly

116

2 天前