RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

khaoula.rkiba · April 5, 2024, 4:12pm

Hi, I’m running Fine-tune Llama 2 with LoRA for Question Answering in a Standard NC80adis H100 v5 (80 vcpus, 640 GiB memory) VM on Azure (The VM has 2 NVIDIA H100) but I get RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers during the training process fine_tuning.train(). I have the NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 installed in my VM. While debugging I can see NCCL version 2.19.3+cuda12.3 . Tried troubleshooting the issue via building NVIDIA/nccl from source code but didn’t work, as well tried the build from the official Nvidia website but no luck. Any hints on how this can be fixed… Thanks in advance
NB: Running the same script in a Standard NC40ads H100 v5 (40 vcpus, 320 GiB memory) instance 1 NVIDIA H100 runs successfully

Topic		Replies	Views
NCCL test on 2x HGX failed with 3G as the upper limit GPU-Accelerated Libraries nccl	0	116	October 16, 2024
NCCL failure : "unhandled system error" for 2 GPUs CUDA on Windows Subsystem for Linux	1	4130	January 21, 2021
Potential NCCL bug in topology discovery in NCCL2.1.15 GPU-Accelerated Libraries	0	1259	March 16, 2018
NCCL example fails on WSL2 and 1 or 2 A5500's cuDNN cuda	3	105	September 15, 2024
Nvcc lower version than CUDA causes compiled code runtime error 300 CUDA NVCC Compiler	4	68	September 24, 2024
NIM TensorRT-LLM on H100 NVL Models nim , llama-31-8b-instruct , llama	2	136	November 22, 2024
Runtime Error when executing OpenCL Samples CUDA Programming and Performance	4	3479	September 24, 2010
RuntimeError: Failed to dlopen libcuda.so.1 \|\| Running Llama 3.3 70B Models nim , llama	1	72	February 17, 2025
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	184	November 12, 2024
Llama-3.1-Nemotron-70B-Instruct An error occurred in MPI_Init_thread Models nim , llama	1	63	February 19, 2025

RuntimeError: NCCL Error 3: internal error - please report this issue to the NCCL developers

Related topics