Device hangs / freezes / crashes under specific circumstances

My otherwise correctly functioning CUDA code seems to hang the device in some configurations of the quantity of blocks / threads. After that point, I can’t launch new kernels and I have to reboot my machine.

How can I start troubleshooting an issue like that?

Circumstances where it happens:
Launching a kernel with ~400 blocks and a thread quantity > 1

Things I’ve tried:
cuda-memcheck (0 Errors)
Testing various other configurations of thread and block size. The performance is best with many blocks and only 1 thread per, then gets worse and eventually stalls completely.

Facts about my setup:

  • Language: C
  • Geforce 3060 12GB
  • Platforms where this bug happens: Both WSL Ubuntu, and Pop!_OS 21.10 dual-boot installation on the same machine
  • CUDA version: 11.2
  • I’m using dynamic parallelism / subkernels. So I could have 10s of thousands of threads once the sub-kernels are launched.

On an Ampere GPU I recommend using compute-sanitizer.

I would recommend proper, comprehensive CUDA error checking (every API call, every kernel launch, in both host and device code).

A hang in device code often requires the same sort of construct as it would in host code: a while loop or similar construct waiting on something. One thread per block is not the way to get good performance from a GPU, and having a code that depends on that for proper behavior is evidence of a design flaw or inappropriate code design of some sort. IMO. I’m not going to argue it in the abstract; others may have a different opinion. I’m quite certain that if you want attractive performance from a CUDA GPU, one thread per block is a terribly bad design choice.

WSL support for CUDA is still pretty new. There may be some rough edges. I’ve never heard of Pop! OS. It isn’t one of the supported distributions for CUDA development, and if I were struggling with an issue like this, I would want to remove that as a possible contributor.

Hi Robert - I looked more into this and followed your advice:

I installed Ubuntu on bare metal, i.e. no WSL.

I set up a new experiment with very simple code performing a simple workload: GitHub - use/cuda-performance-test with a large amount of threads,and sub-threads. I used lots of error checking.

I ran this experiment on both these environments:

I found similar results to my neural net code I mentioned in my OP: when using dynamic parallelism, with the same total number of threads performing the same workloads but in varying grid configurations, the best configurations look like this:

  • Main grid: <<<N, 1>>>
  • Sub grid: <<<1, N>>>

While other grid dimensions can be up to 100x slower or even seem to hang the device (or take so long that I can’t observe it concluding).

Can you help me understand why that would be the case? Is this a commonly known thing? I couldn’t find this documented anywhere. But it seems when using a large amount of threads + subgrids, N-1-1-N is the best configuration. Maybe it’s due to subtleties of thread scheduling?

Here are my experiment results for reference: Local Ubuntu 3060, AWS P2 Tesla K80

(It seems you’re not asking about device hangs/freezes/crashes anymore.)

I don’t really have any insight. I haven’t benchmarked CDP that carefully/closely.

I have not run into anyone reporting it nor seen reports like that.

I have never seen that kind of info documented anywhere.