Device hangs / freezes / crashes under specific circumstances

My otherwise correctly functioning CUDA code seems to hang the device in some configurations of the quantity of blocks / threads. After that point, I can’t launch new kernels and I have to reboot my machine.

How can I start troubleshooting an issue like that?

Circumstances where it happens:
Launching a kernel with ~400 blocks and a thread quantity > 1

Things I’ve tried:
cuda-memcheck (0 Errors)
Testing various other configurations of thread and block size. The performance is best with many blocks and only 1 thread per, then gets worse and eventually stalls completely.

Facts about my setup:

  • Language: C
  • Geforce 3060 12GB
  • Platforms where this bug happens: Both WSL Ubuntu, and Pop!_OS 21.10 dual-boot installation on the same machine
  • CUDA version: 11.2
  • I’m using dynamic parallelism / subkernels. So I could have 10s of thousands of threads once the sub-kernels are launched.

On an Ampere GPU I recommend using compute-sanitizer.

I would recommend proper, comprehensive CUDA error checking (every API call, every kernel launch, in both host and device code).

A hang in device code often requires the same sort of construct as it would in host code: a while loop or similar construct waiting on something. One thread per block is not the way to get good performance from a GPU, and having a code that depends on that for proper behavior is evidence of a design flaw or inappropriate code design of some sort. IMO. I’m not going to argue it in the abstract; others may have a different opinion. I’m quite certain that if you want attractive performance from a CUDA GPU, one thread per block is a terribly bad design choice.

WSL support for CUDA is still pretty new. There may be some rough edges. I’ve never heard of Pop! OS. It isn’t one of the supported distributions for CUDA development, and if I were struggling with an issue like this, I would want to remove that as a possible contributor.

Hi Robert - I looked more into this and followed your advice:

I installed Ubuntu on bare metal, i.e. no WSL.

I set up a new experiment with very simple code performing a simple workload: GitHub - use/cuda-performance-test with a large amount of threads,and sub-threads. I used lots of error checking.

I ran this experiment on both these environments:

I found similar results to my neural net code I mentioned in my OP: when using dynamic parallelism, with the same total number of threads performing the same workloads but in varying grid configurations, the best configurations look like this:

  • Main grid: <<<N, 1>>>
  • Sub grid: <<<1, N>>>

While other grid dimensions can be up to 100x slower or even seem to hang the device (or take so long that I can’t observe it concluding).

Can you help me understand why that would be the case? Is this a commonly known thing? I couldn’t find this documented anywhere. But it seems when using a large amount of threads + subgrids, N-1-1-N is the best configuration. Maybe it’s due to subtleties of thread scheduling?

Here are my experiment results for reference: Local Ubuntu 3060, AWS P2 Tesla K80

(It seems you’re not asking about device hangs/freezes/crashes anymore.)

I don’t really have any insight. I haven’t benchmarked CDP that carefully/closely.

I have not run into anyone reporting it nor seen reports like that.

I have never seen that kind of info documented anywhere.

Hey, you probably already figured it out since you asked this more than 2 years ago, but

Try using cuda streams that are non blocking if you are going to spawn a bunch of blocks, then have those block spawn threads. Like use the streams on the blocks. The threads will be fine.
cudaStream_t curStream;
cudaStreamCreateWithFlags(&curStream, cudaStreamNonBlocking);

If you spawn all your blocks on the same blocking stream or don’t specify it, it uses the default stream for that kernel. The default stream is blocking. If you launch two kernels on the same blocking stream (one at a time), the first kernel runs, then the 2nd waits for the first one to finish. At least that’s what I gathered from the docs and messing with it.

I had the exact same problem. I still have the problem where if you spawn too many blocks, everything just hangs though.

I had a nested kernel call in a parent kernel call of 5000 blocks / 100 threads. The nested kernels would die silently. You can see the error by using this pound define function that gets the last error. Just put this right after a kernel call:
cdpErrchk(cudaPeekAtLastError());

Pound define function:
include <assert.h>
define cdpErrchk(ans) { cdpAssert((ans), FILE, LINE); }
device void cdpAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
printf(“GPU kernel assert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) assert(0);
}
}

The error for me was “GPU kernel assert: launch failed because launch would exceed cudaLimitDevRuntimePendingLaunchCount”

I’m guessing buffered kernel calls in the nest is prob stored on the card rather than on host so it just doesn’t have enough memory to keep buffering.

Edit: figured out the max calls is around 2k nested kernel calls on 2k external calls. That’s 4k calls. I have a 3080 which has over 8k processors. None of those calls should be buffered. The code was testing code that didn’t use any user defined memory. If I had to guess, it’s a hard coded amount.

Edit2: Yes it was hard coded. you can change the hard coded amount using this code(Change 3000 to the amount of nested blocks * threads you need):
cudaStatus = cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 3000);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, “cudaLimitDevRuntimePendingLaunchCount failed! Do you have a CUDA-capable GPU installed?”);
exit(1);
}