Device hangs / freezes / crashes under specific circumstances

user137512 · February 16, 2022, 5:16pm

My otherwise correctly functioning CUDA code seems to hang the device in some configurations of the quantity of blocks / threads. After that point, I can’t launch new kernels and I have to reboot my machine.

How can I start troubleshooting an issue like that?

Circumstances where it happens:
Launching a kernel with ~400 blocks and a thread quantity > 1

Things I’ve tried:
cuda-memcheck (0 Errors)
Testing various other configurations of thread and block size. The performance is best with many blocks and only 1 thread per, then gets worse and eventually stalls completely.

Facts about my setup:

Language: C
Geforce 3060 12GB
Platforms where this bug happens: Both WSL Ubuntu, and Pop!_OS 21.10 dual-boot installation on the same machine
CUDA version: 11.2
I’m using dynamic parallelism / subkernels. So I could have 10s of thousands of threads once the sub-kernels are launched.

Robert_Crovella · March 3, 2022, 4:17pm

On an Ampere GPU I recommend using compute-sanitizer.

I would recommend proper, comprehensive CUDA error checking (every API call, every kernel launch, in both host and device code).

A hang in device code often requires the same sort of construct as it would in host code: a while loop or similar construct waiting on something. One thread per block is not the way to get good performance from a GPU, and having a code that depends on that for proper behavior is evidence of a design flaw or inappropriate code design of some sort. IMO. I’m not going to argue it in the abstract; others may have a different opinion. I’m quite certain that if you want attractive performance from a CUDA GPU, one thread per block is a terribly bad design choice.

WSL support for CUDA is still pretty new. There may be some rough edges. I’ve never heard of Pop! OS. It isn’t one of the supported distributions for CUDA development, and if I were struggling with an issue like this, I would want to remove that as a possible contributor.

user137512 · March 26, 2022, 8:44pm

Hi Robert - I looked more into this and followed your advice:

I installed Ubuntu on bare metal, i.e. no WSL.

I set up a new experiment with very simple code performing a simple workload: GitHub - use/cuda-performance-test with a large amount of threads,and sub-threads. I used lots of error checking.

I ran this experiment on both these environments:

Local Ubuntu 20.04 LTS, RTX 3060
AWS P2 instance, Ubuntu 18.04, Tesla K80 (AWS Marketplace: AWS Deep Learning Base AMI (Ubuntu 18.04))

I found similar results to my neural net code I mentioned in my OP: when using dynamic parallelism, with the same total number of threads performing the same workloads but in varying grid configurations, the best configurations look like this:

Main grid: <<<N, 1>>>
Sub grid: <<<1, N>>>

While other grid dimensions can be up to 100x slower or even seem to hang the device (or take so long that I can’t observe it concluding).

Can you help me understand why that would be the case? Is this a commonly known thing? I couldn’t find this documented anywhere. But it seems when using a large amount of threads + subgrids, N-1-1-N is the best configuration. Maybe it’s due to subtleties of thread scheduling?

Here are my experiment results for reference: Local Ubuntu 3060, AWS P2 Tesla K80

Robert_Crovella · March 27, 2022, 11:39pm

(It seems you’re not asking about device hangs/freezes/crashes anymore.)

I don’t really have any insight. I haven’t benchmarked CDP that carefully/closely.

I have not run into anyone reporting it nor seen reports like that.

I have never seen that kind of info documented anywhere.

user136316 · September 1, 2024, 5:48pm

Hey, you probably already figured it out since you asked this more than 2 years ago, but

Try using cuda streams that are non blocking if you are going to spawn a bunch of blocks, then have those block spawn threads. Like use the streams on the blocks. The threads will be fine.
cudaStream_t curStream;
cudaStreamCreateWithFlags(&curStream, cudaStreamNonBlocking);

If you spawn all your blocks on the same blocking stream or don’t specify it, it uses the default stream for that kernel. The default stream is blocking. If you launch two kernels on the same blocking stream (one at a time), the first kernel runs, then the 2nd waits for the first one to finish. At least that’s what I gathered from the docs and messing with it.

I had the exact same problem. I still have the problem where if you spawn too many blocks, everything just hangs though.

user136316 · September 1, 2024, 10:42pm

I had a nested kernel call in a parent kernel call of 5000 blocks / 100 threads. The nested kernels would die silently. You can see the error by using this pound define function that gets the last error. Just put this right after a kernel call:
cdpErrchk(cudaPeekAtLastError());

Pound define function:
include <assert.h>
define cdpErrchk(ans) { cdpAssert((ans), FILE, LINE); }
device void cdpAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
printf(“GPU kernel assert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) assert(0);
}
}

The error for me was “GPU kernel assert: launch failed because launch would exceed cudaLimitDevRuntimePendingLaunchCount”

I’m guessing buffered kernel calls in the nest is prob stored on the card rather than on host so it just doesn’t have enough memory to keep buffering.

Edit: figured out the max calls is around 2k nested kernel calls on 2k external calls. That’s 4k calls. I have a 3080 which has over 8k processors. None of those calls should be buffered. The code was testing code that didn’t use any user defined memory. If I had to guess, it’s a hard coded amount.

Edit2: Yes it was hard coded. you can change the hard coded amount using this code(Change 3000 to the amount of nested blocks * threads you need):
cudaStatus = cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 3000);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, “cudaLimitDevRuntimePendingLaunchCount failed! Do you have a CUDA-capable GPU installed?”);
exit(1);
}

Topic		Replies	Views
Inexpiable CUDA hang (NOT WDM timeout!) CUDA Programming and Performance	2	1477	June 5, 2014
Error: Failed to suspend device for CUDA device 0 CUDA Programming and Performance	8	4550	January 4, 2023
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23204	October 7, 2010
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7938	June 14, 2013
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	52056	June 7, 2022
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3464	March 10, 2011
Dynamic parallelism, Kernel didn't launch CUDA Programming and Performance	6	1863	September 12, 2016
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	7824	August 20, 2015
kernels timeout or hang intermitently CUDA Programming and Performance	9	3728	July 25, 2013
cudaThreadSynchronize() stalls application CUDA Programming and Performance	10	10989	November 17, 2009

Device hangs / freezes / crashes under specific circumstances

Related topics