ncclGroupEnd "unhandled cuda error"

robosmith · March 7, 2018, 9:34pm

Trying to use NCCL 2.05 for multi-GPU reduction of complex values on AWS p3 Volta instances with Ubuntu 16.04 & SDK 9.1

My code:

ncclComm_t comms[MAX_GPU];
NCCLCHECK(ncclCommInitAll(comms, nDevices, nullptr));

NCCLCHECK(ncclGroupStart());
for (int j = 1; j < nlim; j++) 
{
	NCCLCHECK(ncclReduce((const void*)params[j]->dev_wrxsa[0], (void*)params[0]->dev_wrxsa[0], 
			npulse_size * 2, ncclFloat, ncclSum, 0, comms[j], streams[j]));
}
NCCLCHECK(ncclGroupEnd());

I am getting “unhandled cuda error” on the ncclGroupEnd function call.
If I delete that line, the code will sometimes complete w/o error, but mostly core dumps.

The send and receive buffers are allocated with cudaMallocManaged.

I’m expecting this to sum all other GPU’s buffers into the GPU 0 buffer.

How can I figure out what’s causing the “unhandled cuda error”?

Robert_Crovella · March 7, 2018, 9:53pm

cuda-memcheck ?

robosmith · March 8, 2018, 12:42am

Thanks for the suggestion txbob.

Ran memcheck and all I get (besides a bunch of peer access already enabled errors) is:

========= Program hit cudaErrorNotSupported (error 71) due to "operation not supported" on CUDA API call to cudaEventRecord.
Failed, NCCL error RadarSim/Radar.cu:733 'unhandled cuda error'
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/jet/lib/libcuda.so.1 [0x311153]
=========     Host Frame:./clutterProc.exe [0x8ba5a]
=========     Host Frame:./clutterProc.exe [0x1ab5d]
=========     Host Frame:./clutterProc.exe [0x1611a]
=========     Host Frame:./clutterProc.exe [0xa724]
=========     Host Frame:./clutterProc.exe [0x80e7]
=========     Host Frame:/jet/lib/libc.so.6 (__libc_start_main + 0xf0) [0x20660]
=========     Host Frame:./clutterProc.exe [0x6df9]

Cannot make anything out of ^this.

I even disable all my peer access requests, but I still get superfluous errors about it.

Robert_Crovella · March 8, 2018, 4:06am

So now what you’re really asking is “please help me debug my code”.

For that, I would not even think about getting started without a complete test case.

In this particular case, I’m reasonably sure there is a NCCL allreduce sample code, so I would certainly suggest you study that.

robosmith · March 8, 2018, 5:26am

No, my code is working perfectly, until I add NCCL calls to make NVLink speeds work in Linux.

The code I added which is detailed here, is almost identical to the allreduce example, except I am using ncclReduce and complex data types as float x 2. Should work since I’m just summing real and imag independently.

In fact the previous version works very fast with NVLink speeds in Windows on a 4 GPU instance with a custom kernel for summing complex float buffers.

I just cannot identify anything which could cause ncclGroupEnd to create an “unhandled cuda error” and I don’t even know what that is or what COULD cause it.

robosmith · March 10, 2018, 12:51am

I compiled and ran the allReduce example code on an AWS 8 GPU p3 instance running Ubuntu 16.
Linked with libnccl_static.a from NCCL v2.0.4
It runs with no errors.

I then changed it to call ncclReduce with root node 0 and it also runs with no errors.

robosmith · March 10, 2018, 6:31am

allreduce.cpp listed below

#include <stdio.h>
#include <stdlib.h>
#include "cuda_runtime.h"
#include "nccl.h"
#define EXIT_FAILURE 1
#define MAX_GPU 8

#define CUDACHECK(cmd) do {                         \
  cudaError_t e = cmd;                              \
  if( e != cudaSuccess ) {                          \
    printf("Failed: Cuda error %s:%d '%s'\n",             \
        __FILE__,__LINE__,cudaGetErrorString(e));   \
    exit(EXIT_FAILURE);                             \
  }                                                 \
} while(0)

#define NCCLCHECK(cmd) do {                         \
  ncclResult_t r = cmd;                             \
  if (r!= ncclSuccess) {                            \
    printf("Failed, NCCL error %s:%d '%s'\n",             \
        __FILE__,__LINE__,ncclGetErrorString(r));   \
    exit(EXIT_FAILURE);                             \
  }                                                 \
} while(0)

int main(int argc, char* argv[])
{
  ncclComm_t comms[MAX_GPU];
  //managing  devices
  int nDev = 4;
  CUDACHECK(cudaGetDeviceCount(&nDev));
  printf("%d gpus found\n", nDev);
  int size = 32*1024*1024;
  int devs[4] = { 0, 1, 2, 3 };

  //allocating and initializing device buffers
  float** sendbuff = (float**)malloc(nDev * sizeof(float*));
  float** recvbuff = (float**)malloc(nDev * sizeof(float*));
  cudaStream_t* s = (cudaStream_t*)malloc(sizeof(cudaStream_t)*nDev);

  for (int i = 0; i < nDev; ++i) {
    printf("init gpu %d\n", i);
    CUDACHECK(cudaSetDevice(i));
    CUDACHECK(cudaMalloc(sendbuff + i, size * sizeof(float)));
    CUDACHECK(cudaMalloc(recvbuff + i, size * sizeof(float)));
    CUDACHECK(cudaMemset(sendbuff[i], 1, size * sizeof(float)));
    CUDACHECK(cudaMemset(recvbuff[i], 0, size * sizeof(float)));
    CUDACHECK(cudaStreamCreate(s+i));
  }

  //initializing NCCL
  NCCLCHECK(ncclCommInitAll(comms, nDev, 0));

  //calling NCCL communication API. Group API is required when using
  //multiple devices per thread
  NCCLCHECK(ncclGroupStart());
  for (int i = 0; i < nDev; ++i)
    NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum,
        comms[i], s[i]));
  NCCLCHECK(ncclGroupEnd());

  //synchronizing on CUDA streams to wait for completion of NCCL operation
  for (int i = 0; i < nDev; ++i) {
    CUDACHECK(cudaSetDevice(i));
    CUDACHECK(cudaStreamSynchronize(s[i]));
  }

  //free device buffers
  for (int i = 0; i < nDev; ++i) {
    CUDACHECK(cudaSetDevice(i));
    CUDACHECK(cudaFree(sendbuff[i]));
    CUDACHECK(cudaFree(recvbuff[i]));
  }

  //finalizing NCCL
  for(int i = 0; i < nDev; ++i)
      ncclCommDestroy(comms[i]);

  printf("Success \n");
  return 0;
}

hamid.ramezani1375 · October 23, 2020, 3:00pm

Hi @robosmith :)

I struggle with the exact same error for a week. I was wondering if you can provide more details that how you solved it?

Best,
Hamid

robosmith · October 23, 2020, 3:30pm

I no longer use NCCL. I use cudaMemCpyPeer and a custom CUDA kernel to sum the 2 memory blocks.

Topic		Replies	Views
Why is the performance of cudaLaunchHostFunc low, and what are the optimization strategies or alternative solutions? CUDA Programming and Performance cuda	8	61	September 14, 2024
NCCL Error: “invalid device function” - Is it due to NCCL version incompatibility with CUDA 11.3? CUDA Setup and Installation cuda	0	70	January 20, 2025
Linux kernel 5.10+ CUDA_ERROR_MISALIGNED_ADDRESS CUDA Setup and Installation	0	826	September 29, 2021
NCCL code example questions CUDA Programming and Performance	1	423	January 11, 2024
Using unified memory causes system crash CUDA Programming and Performance	28	5873	February 4, 2019
How to debug "Invalid memory reference" while generating linker CUDA Programming and Performance	17	5178	August 21, 2014
nvcc and googletest CUDA Programming and Performance	5	16370	July 7, 2011
Program gives unexpected error compiles smooth, but output is unexpected result CUDA Programming and Performance	5	3295	October 17, 2011
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	414	October 12, 2021
Cuda runtime failing to launch kernel from CSharp Wrapper function. CUDA Programming and Performance	6	1373	April 13, 2016

ncclGroupEnd "unhandled cuda error"

Related topics