"too many resources requested for launch." - on *second* launch of a kernel

k<<<4, 512, 0, stream>>> (…

First launch succeeds.

Second launch fails with “too many resources requested for launch”.

Same block-count and thread-count; same stream.

This doesn’t occur in an optimized build. Only in a debug build (-g -G -O0). Reducing the thread-count to 256 didn’t fix it.

I thought resource usage only depended on the shared memory size (0 here) and the number of registers required (thread count * kernel requirements per thread) … which should be the same on both launches.

What am I missing?

Another resource is memory usage, for local memory, stack, and other types of usage. One of the resource parameters here is the total possible thread count, (number of SMs times 2048) which won’t be affected by reducing thread count. Nevertheless it looks like a fairly unusual case to me.

The kernel uses only a handful of local variables (5 pointers, 1 short, and 8 integers).

And it is self-contained (calls no functions).

Re total possible thread count, does that relate to other kernels running in parallel? This is the only kernel present, nothing else is running on the device.

I wouldn’t be able to say what is happening in your case without a complete test case to examine. Even then, I make no claims or guarantees.

I gave a fairly specific definition of total thread count. I am referring to the number of SMs in the GPU you are running on multiplied by 2048. No other considerations. It is one of the parameters the runtime uses to determine how much memory to reserve for a particular kernel launch. I really have no idea if this is related to what you are seeing or not. I was merely pointing out that shared memory usage and registers are not the only resources required for a kernel launch.

Yes, understood.

Threads: the GPU is a 1080 Ti with 28 SMs and 11GB GDDR5. It seems the kernel is well under the thread limit.

Memory: there’s a static data structure consuming 7GB, and a small per-kernel temp* buffer that’s around 4MB. That would seem to leave plenty of memory.

Without a more specific error code, there’s guesswork. Hopefully at some point the dev team will refine the error codes to pinpoint which resource there was a problem with. For now, I’ll stick to optimized builds and hope it doesn’t occur there.

Thanks.

* it’s actually set aside between kernels and re-used.

perhaps, right before the 2nd (failing) kernel launch, you could do a cudaMemGetInfo, and print out the available free memory. If it is “small” it might indicate that the issue is related to memory. Again, just speculation here. You might even be able to estimate the amount of memory needed for the kernel launch by doing a cudaMemGetInfo immediately before and after the 1st kernel launch.

If you’d like to see a change to CUDA behavior (e.g. more detailed error) you’re welcome to file a bug at developer.nvidia.com