too many resources requested for launch

kmccall · November 29, 2010, 4:51pm

I’m using these flags: -arch=compute_20 -code=sm_20. Are they equivalent to the one you mentioned?

seibert · November 29, 2010, 5:25pm

I think that should be just as good as what I posted.

This would be a situation where it would be really nice if NVIDIA could make that error message more specific about exactly which resource is insufficient…

kmccall · November 29, 2010, 5:37pm

Is there a resource other than shared memory and # of registers that could be the culprit?

By the way, the kernel has a lot of 32-bit arguments (15 to be exact). Could that be eating up shared memory?

seibert · November 29, 2010, 6:37pm

Compute capability 2.0 passes kernel arguments via constant memory, so that should not be a problem here.

Smokey · November 29, 2010, 10:27pm

Besides register count / thread count / shared memory / constant memory, there’s not much that would raise this error.

When using the Driver API, this is a pretty typical error message in the 3.2 SDK when you set cuParamSetSize to anything larger than what your kernel is really requiring (eg: if your function is void(int), and you pass in anything > sizeof(int), you get this error). Which never happened prior to 3.2 (I can only assume someone added this sanity checking, but was too lazy to add a meaningful error message).

Completely unhelpful error message, I know…

Anyway, my understanding is that with the runtime API you don’t have to specify the parameter set size yourself? (it’s done automatically), but if you stepped into the disassembly and figured out how much it was calculating for this size - I wouldn’t be surprised if there was a bug in the runtime api causing this to happen for some types in the formal parameter list…

cudesnick · November 29, 2010, 11:39pm

Hope the following qualifies as “any suggestion”: in such situations I attempt to reduce each potentially limiting resource, e.g. dynamically allocated shared memory, or the thread count, or constant memory, one by one, until I get the kernel to launch. The kernel might crash, of course, e.g. due to running out of the shared memory space, but that’s different from not launching. This way I can often get some clue what’s going on and what to blame nVidia for (too little shared memory vs too few registers, etc).

Also, have you added --ptxas-options=-v to nvcc command line for some verbosity in resource usage?

In addition to that, if you know in advance how many threads you intend to run on SM you can pass --maxrregcount=XX argument to nvcc, where XX is the number of registers to use in each thread of your kernel. This way you avoid using local memory, which is a show stopper for many kernels.

Mark_Harris · November 30, 2010, 1:07am

Hi folks, we think this is a mistake in the occupancy calculator calculation. We’ll have to get an updated version out in the next SDK release. Meanwhile, I suggest you use __launch_bounds() to specify the max block size you will use with the kernel to ensure it will launch. In fact, I highly recommend you use it whenever possible!

Mark

kmccall · December 1, 2010, 5:31pm

Hope the following qualifies as “any suggestion”: in such situations I attempt to reduce each potentially limiting resource, e.g. dynamically allocated shared memory, or the thread count, or constant memory, one by one, until I get the kernel to launch. The kernel might crash, of course, e.g. due to running out of the shared memory space, but that’s different from not launching. This way I can often get some clue what’s going on and what to blame nVidia for (too little shared memory vs too few registers, etc).

Also, have you added --ptxas-options=-v to nvcc command line for some verbosity in resource usage?

In addition to that, if you know in advance how many threads you intend to run on SM you can pass --maxrregcount=XX argument to nvcc, where XX is the number of registers to use in each thread of your kernel. This way you avoid using local memory, which is a show stopper for many kernels.

Following cudesnick’s suggestions, I reduced shared memory and register usage (for the latter, using the nvcc --maxrregcount==XX option) until the kernel was able to launch. The limiting factor turned out to be register usage.

Could this be more than an error in the Occupancy Calculator? I dynamically calculate my block size for each kernel instance, using the formulas in section 4.2 of the Programmer’s Guide for the total number of registers and shared memory allocated for a block. I compare these values to the limits returned by cudaGetDeviceProperties() to determine if the block size is feasible. For the kernel that wouldn’t launch, my values and those of the Occupancy Calculator agreed and indicated that the kernel should launch, but it didn’t.

cudesnick · December 1, 2010, 6:02pm

I’m not convinced it’s generally the right thing to do, although in your particular case it might be. It’s quite possible, that you don’t want to run as many threads on a SM, as your architecture allows you to, for performance reasons. I think it’s better to experiment with different thread counts and choose the optimal number by trial and error. See, for example

http://www.eecs.berkeley.edu/~volkov/volkov10-GTC.pdf