Kernel execution failed: Too many resources..

Hello people,

I’m sorry if this is a repeat question, I tried to search but got no valid results. I just wanted to ask in what situations can a CUDA kernel exit with the following error:-

“Kernel execution failed :too many resources requested for launch”

I have a kernel where I allocate huge data to memory, and it works correctly. When I tried to allocate the same data through textures using CUDA Arrays (textures are limited in the x dimension to 65536, so I wrap the bytes around) , I get the above failure.


I’ve only seen that message when requesting a block dimension with more threads than can run with the given register usage. There are 8192 registers, so the largest block you can run is 8192/registers used (from cubin). Or 512 as a device maximum.

It might also show up if you request more shared memory than is available (16k).

Thanks! I do find that I’m exceeding 8192.

I too am now getting the “too many resources requested for launch” error. This is actually occurring in a kernel that worked previously to my installing 1.1. The kernel uses 17 reg/thread, 64 threads/block, so well under the 8192 limit. The shared memory is 8336. I have not changed any of the code around the kernel call, so in theory it is using the default 0 stream, and I have tried running the kernel with the stream set explicitly to 0. The guide says that this should cause kernel launches and memory copies to wait for all preceding operations, but is it possible the the (similar) shared memory use from a preceding kernel or following kernel could be causing this error?
Any help is appreciated.

To be exactly 512threads max per block and 768threads per multiprocessor.

This seems odd. Are you requesting any “dynamic” shared memory in the kernel invocation. Perhaps it is being set to an uninitialized variable or something.

No, shared memory has constant size, allocated in the kernel. I’ve tried running the kernel by itself, and also not using shared memory (goes to local), and I’m getting the same result. I’m going to keep fiddling, maybe I can isolate the cause.

Okay, the kernel has a line:
nCands[candIdx] = Min(Min(tempCands[0], tempCands[1]), Min(tempCands[2], tempCands[3]));


device float3 Min(float3 A, float3 B)
if (A.z < B.z)
return A;
return B;

nCands and tempCands[i] are float3. The kernel still fails when I change this to

float3 tempF = Min(Min(tempCands[0], tempCands[1]), Min(tempCands[2], tempCands[3]));

nCands[candIdx].x = tempF.x;
nCands[candIdx].y = tempF.y;
nCands[candIdx].z = tempF.z;

However, if I comment out either the .x assignment or the .y assignment, the kernel runs no problem. It also works if nCands[candIdx] is assigned constants using make_float3. I can also assign tempF.x and tempF.y to float variables.

This kernel did actually work under 1.0

Edit: Definitely a register problem. I think I was side tracked by the fact that the kernel had once been working. When I got my register use down by one (actually, by using separate float arrays instead of one float3 array) my kernel started working again.

Are you a registered developer? If you are, can you please file a bug through the registered developer site and attach the original kernel in the form where it worked on CUDA 1.0 but failed on CUDA 1.1?

We want to catch this sort of register allocation regression and fix it if possible.