Repository navigation

nccl

Website
Wikipedia

cupy / cupy

NumPy & SciPy for GPU

CUDA cudnn cublas cusolver nccl Python NumPy cupy curand cusparse gpu SciPy tensor rocm

Python

10527

954

1 天前

coreylowman / cudarc

Safe rust wrapper around CUDA toolkit

CUDA cuda-programming gpu gpu-acceleration Rust cublas curand cuda-kernels cudnn nccl

Rust

932

117

1 个月前

huggingface / llm_training_handbook

An open collection of methodologies to help with successful training of large language models.

CUDA large-language-models 大语言模型 nccl 自然语言处理 performance Python PyTorch scalability troubleshooting

Python

512

2 年前

LambdaLabsML / distributed-training-guide

Best practices & guides on how to write distributed pytorch training code

CUDA deepspeed distributed-training gpu gpu-cluster kuberentes nccl PyTorch slurm cluster mpi sharding

Python

488

7 个月前

huggingface / large_language_model_training_playbook

An open collection of implementation tips, tricks and resources for training large language models

CUDA 大语言模型 nccl 自然语言处理 performance Python PyTorch scalability troubleshooting large-language-models

Python

481

3 年前

FZJ-JSC / tutorial-multi-gpu

Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial

hpc mpi gpu nccl CUDA

Cuda

303

1 个月前

Bluefog-Lib / bluefog

Distributed and decentralized training framework for PyTorch over graph

mpi distributed-computing 深度学习机器学习 PyTorch decentralized asynchronous one-sided nccl

Python

251

1 年前

microsoft / msrflute

Federated Learning Utilities and Tools for Experimentation

federated-learning gloo 机器学习 nccl personalization privacy-tools PyTorch Simulation

Python

191

2 年前

google / nccl-fastsocket

NCCL Fast Socket is a transport layer plugin to improve NCCL collective communication performance on Google Cloud.

nccl training 机器学习

C++

120

2 年前

openhackathons-org / nways_multi_gpu

N-Ways to Multi-GPU Programming

CUDA hpc mpi nccl

2 个月前

muriloboratto / NCCL

Sample examples of how to call collective operation functions on multi-GPU environments. A simple example of using broadcast, reduce, allGather, reduceScatter and sendRecv operations.

nccl CUDA mpi

2 年前

JuliaGPU / NCCL.jl

A Julia wrapper for the NVIDIA Collective Communications Library.

Julia 语言 CUDA gpu nccl

Julia

9 天前

lanl / pyDNMFk

Python Distributed Non Negative Matrix Factorization with custom clustering

distributed-computing hpc cupy 机器学习 nccl outofmemory Python

Python

2 年前

Zerohertz / Instruct_KR_2025_Summer_Meetup_vLLM

🎹 Instruct.KR 2025 Summer Meetup: 오픈소스 LLM, vLLM으로 Production까지 🎹

CUDA Kubernetes 大语言模型 mlops nccl rdma serving vllm ray

Shell

2 个月前

1duo / nccl-examples

NCCL Examples from Official NVIDIA NCCL Developer Guide.

Nvidia nccl 深度学习 distributed-systems

CMake

7 年前

BaguaSys / bagua-net

High performance NCCL plugin for Bagua.

nccl distributed-computing

Rust

4 年前

Zerohertz / PyCon_KR_2025_Tutorial_vLLM

🐍 PyCon Korea 2025 Tutorial: vLLM의 OpenAI-Compatible Server 톺아보기 🐍

CUDA Kubernetes 大语言模型 mlops nccl rdma serving vllm ray

Shell

2 个月前

YinLiu-91 / ncclOperationPlus

use ncclSend ncclRecv realize ncclSendrecv ncclGather ncclScatter ncclAlltoall

nccl CUDA mpi C++

Cuda

4 年前

YconquestY / nccl

Summary of call graphs and data structures of NVIDIA Collective Communication Library (NCCL)

computer-network nccl

1 年前

lancelee82 / pynccl

Nvidia NCCL2 Python bindings using ctypes and numba.

nccl numba Python

Python

4 年前