I have recently been posting on the general CUDA forum when I think I should have been posting on this one. This post relates to the emulation v GPU execution outputs disagreeing.

Having added the following line to my code after the call to the kernel

printf("\n\n%s\n", cudaGetErrorString(cudaGetLastError()));

the following is output

“too many resources requested for launch”

Rispek’ to E.D. Riedijk for nailing that! :thumbup:

Appendix A.1.1 lists restrictions on memory, threads, registers etc

I have calculated the amount of memory I am using (all in global at the moment but that will change as soon as I get this problem sorted out) and I am well within the max for global memory.

Am I using too many register variables? I have just calculated that I have 74 register variables in total over one kernel and nine device functions , one of which has 25 register variables, for each thread. But according to A.1.1 there are 8192 registers in total. If the largest function with 25 register variables was executing concurrently on 128 threads then this would require 9472 registers. However I do not get a insufficient resource error when I execute on 128 threads ie 4 warps.

So ???

Does CUDA report or give a clue as to which resource or resource type is insufficient?

How does CUDA compute that too many resources have been requested for launch in order to say so, and if it does not do so then why does it not report which resources are causing the error?

The numbers are for a kernel, not per device function. In order to see how many registers your kernel needs you should look at the cubin output. You problem is most likely that each kernel uses more than 64 registers per thread, hence the “too many resources requested”. This can also happen if you request too much shared memory btw.

correction : 128*25 = 3200, not 9472 :">

This is my kernel


global void SPH(float* x,float* vx,float* rho,float* p,float* mass,float* hsml,float* du,float* tdsdt,float* indvxdt,float* c,float* ardvxdt,float* avdudt,float* v_min,float* u_min,float* dx,float* dvx,float* av,float* ds,float* t,float* dedt,float* dvxdt,int* itype,int* dfindexstart,int* dfindexstop,float* w,float* dwdx,

int* pair_i,int* pair_j,float* u)


int	maxtimestep;

float dt;

	dt = 0.005;

maxtimestep = MAXITS; //should be 200 = 1 second







time_integrationd calls 7 other device functions.

I count only 2 register variables; maxtimestep and dt.

As stated the error is reported for >=193 threads.

The functions with register varaibles are


inputd 1

time_integration 3

single_step 2

direct_find 11

kerneld 3

p_gas 1

sum_density 12

int_forced 14

art_visc 25

The sequence of function calls is



for(i=0 ; i<maxtimestep ; i++) single_step();













Adding “–ptxas-options=”-v"" to your compiler arguments should make the compiler print the number of registers needed per thread. Multiply that with you block size. As the whole block must fit into one MP that number has to be <= 8192.

Just tried that and got

ptxas info : compiling entry function SPH …

ptxas info : used 39 registers, 249+240 bytes smem, 72 bytes cmem[1]

I also tried the -cubin option expecting to find a .cubin file in the working dir but could not find one.

I currently have one block of N threads, where N = 32i, i=1,2,3,… i.e. multiples of warps.

8192/39 = maximum of 210 threads

this would explain why the resource error occurs for 7 warps (224 threads), but does not explain why the resource error occurs for 6 warps +1 thread = 193 threads, unless because with 193 threads I require 7 warps i.e. 1 thread does not execute but the whole 7th warp does, then effectively 7 warps of registers are required even though I only use 1 thread from the 7th warp.

Is that it?

So why are 39 registers required per thread? I don’t understand how that is calculated. Can anyone explain how that is calculated?

The kernel is passed 29 variables, which passes all those down to time_integration plus two more variables = 31.


As far as I understood the architecture that is it. The hardware is 8 way SIMD and therefore can only execute full warps. There is no finer granularity in thread creation. That’s why your code should also always boundary check.

The kernel calls 2 functions;

inputd requires 8 variables

time_integration requires 31 variables

31 + 8 = 39, even though the 8 for inputd are also included in the 31 for time_integration.

How does declaring the variables as device affect this?

Working out register usage by counting variables is not a good idea. Firstly I think the input variables to the kernel are in shared memory and secondly it’s just not how it works. For example - registers are used for intermediate calculations where a variable may not be defined.

While registers are related to your variables they are not identical.

OK, I’m just guessing. It is a bit of a coincidence though.

I’m just a bit concerned at how the number of regsiter vars per thread is calculated. I don’t know. So how am I to amend my code without any clues as to which variables are required for the kernel launch?

You could always use the occupancy calculator in order to find out how many threads you can actually run per block…

Yes, I understand that. But that requires a lot of trial and error work in amending the source code. I was hoping for a more scientific way kowing how the registers per thread value is calculated and which variables the registers are used for.

Is that info not reported anywhere? Surely if the -cubin option can calculate it then the compiler knows which variables are using those registers?

BTW I tried compiling with the -cubin option and could find nothing being reported. What needs to have been installed for the -cubin option to work?

Has someone told you about -maxrregcount=N yet?

The amount of registers per thread is not calculated. Here are the steps to convert your source code to something that will run on the device:

  1. compile your .cu sourcecode to a .ptx file. Here each new variable gets a new register, so you see a lot of registers in there (if you add -keep, the .ptx will not be deleted)
  2. converted the .ptx with ptxas into machine code (.cubin). Here agressive register optimization is performed. After all that is done, it is known how many registers are needed per thread, and that is what is reported.

It is advisable to always use -ptxas-options=-v and fill in the reported values in the occupancy calculator to see how many threads your can request per block. That beats the trial and error approach ;)