Another "too many resources requested" issue

Hi all,

The title already gives away that one of my kernels for some reason reports that I requested too many resources. I did some research and learned that is due to either shared memory or registers. The kernel is called with a dynamic amount of shared memory, so normally I won’t get any information on this when inspecting the verbose output of ptxas (-Xptxas -v). However, for my testcase I know what the amount will be (418 * sizeof(uint)), so I just hardcoded this as a static shared memorypool instead. The output is as follows:

ptxas info : Used 17 registers, 6688 bytes smem, 44 bytes cmem[0]

The number of threads per block is 418, so the total number of registers per MP is 7106, which is well within the bounds of my Tesla M2075 (cc 2.0). The amount of shared is not very impressive either, is it? Why then does this kernel report that I requested too many resources?

Any ideas would be welcome :-)


what is the total amount of shared memory per block for the Tesla M2075?
As far as I know your 6688 bytes are per thread. 418*6688=2795584 bytes of shared memory per block.
On my GTX 660 this would be well above the GPUs limit of 49152 bytes. But as mentioned above, I dont know the spec of the M2075.

edit: What happens if you decrease the number of threads/block, say to 128?

Thanks for your reply! The M2075, like your GTX660, has 49152 bytes per block shared memory available. But you say the 6688 bytes are per thread? Why do you think that is the case? It’s supposed to be the shared memory, which is shared per block, right?

The algorithm in the kernel highly depends on the number of blocks/threads being as it is now, but even if I lower this (and make the algorithm fail), I get the same error. This is not very surprising, as I’m quite sure that the shared memory limit has not been exceeded…

Any other thoughts maybe? I must be overlooking something trivial as always…

Sorry, smem is the shared memory per block, of course…

Thanks for your effort anyway :-)

I still didn’t solve it though! Anyone?

What are the arguments of compiling? Are you using the flag -arch=sm_20. If not then it will create code for compute capability 1.0 which could explain.

Nope! Compiling with -arch=sm_20.


Are you passing a lot of parameters or actually passing parameters that are large structures (rather than pointers to arrays/structures in GPU RAM)
Other thing I would suspect is some calculated array reference that is outside the 418*sizeof(uint)

Also (and I don’t think it is your problem, but just mentioning for future reference)
Registers used to be allocated as chunks of 16 per thread, if so then you are effectively using 32 registers not 17 per thread, and also part warps count still as 32 threads. From the graph I am reading this would mean you are using 16384 total registers. (“Demystifying GPU Microarchitecture through Microbenchmarking” but its dated 3 years ago and lots has changed since then.)
And I think 64 bit floats “doubles” take more register space than 32 bit float.

He is using 17 registers with 418 threads per block and 6668 of shared memory on 2.0. All these are within the limits even for a cc 1.0. I think you might have an error in the main program and accidentally asking more shared memory or more threads than are allowed.