TOO MANY RESOURCES REQUESTED FOR LAUNCH

chrismc · September 2, 2008, 8:16am

I have recently been posting on the general CUDA forum when I think I should have been posting on this one. This post relates to the emulation v GPU execution outputs disagreeing.

Having added the following line to my code after the call to the kernel

printf(“\n\n%s\n”, cudaGetErrorString(cudaGetLastError()));

the following is output

“too many resources requested for launch”

Rispek’ to E.D. Riedijk for nailing that! External Image

Appendix A.1.1 lists restrictions on memory, threads, registers etc

I have calculated the amount of memory I am using (all in global at the moment but that will change as soon as I get this problem sorted out) and I am well within the max for global memory.

Am I using too many register variables? I have just calculated that I have 74 register variables in total over one kernel and nine device functions , one of which has 25 register variables, for each thread. But according to A.1.1 there are 8192 registers in total. If the largest function with 25 register variables was executing concurrently on 128 threads then this would require 9472 registers. However I do not get a insufficient resource error when I execute on 128 threads ie 4 warps.

So ???

Does CUDA report or give a clue as to which resource or resource type is insufficient?

chrismc · September 2, 2008, 8:29am

I have recently been posting on the general CUDA forum when I think I should have been posting on this one. This post relates to the emulation v GPU execution outputs disagreeing.

Having added the following line to my code after the call to the kernel

printf(“\n\n%s\n”, cudaGetErrorString(cudaGetLastError()));

the following is output

“too many resources requested for launch”

Rispek’ to E.D. Riedijk for nailing that! External Media

Appendix A.1.1 lists restrictions on memory, threads, registers etc

I have calculated the amount of memory I am using (all in global at the moment but that will change as soon as I get this problem sorted out) and I am well within the max for global memory.

Am I using too many register variables? I have just calculated that I have 74 register variables in total over one kernel and nine device functions , one of which has 25 register variables, for each thread. But according to A.1.1 there are 8192 registers in total. If the largest function with 25 register variables was executing concurrently on 128 threads then this would require 9472 registers. However I do not get a insufficient resource error when I execute on 128 threads ie 4 warps.

So ???

Does CUDA report or give a clue as to which resource or resource type is insufficient?

[snapback]434287[/snapback]

How does CUDA compute that too many resources have been requested for launch in order to say so, and if it does not do so then why does it not report which resources are causing the error?

Linny · September 2, 2008, 8:30am

The numbers are for a kernel, not per device function. In order to see how many registers your kernel needs you should look at the cubin output. You problem is most likely that each kernel uses more than 64 registers per thread, hence the “too many resources requested”. This can also happen if you request too much shared memory btw.

chrismc · September 2, 2008, 8:51am

I have recently been posting on the general CUDA forum when I think I should have been posting on this one. This post relates to the emulation v GPU execution outputs disagreeing.

Having added the following line to my code after the call to the kernel

printf(“\n\n%s\n”, cudaGetErrorString(cudaGetLastError()));

the following is output

“too many resources requested for launch”

Rispek’ to E.D. Riedijk for nailing that! External Media

Appendix A.1.1 lists restrictions on memory, threads, registers etc

I have calculated the amount of memory I am using (all in global at the moment but that will change as soon as I get this problem sorted out) and I am well within the max for global memory.

Am I using too many register variables? I have just calculated that I have 74 register variables in total over one kernel and nine device functions , one of which has 25 register variables, for each thread. But according to A.1.1 there are 8192 registers in total. If the largest function with 25 register variables was executing concurrently on 128 threads then this would require 9472 registers. However I do not get a insufficient resource error when I execute on 128 threads ie 4 warps.

So ???

Does CUDA report or give a clue as to which resource or resource type is insufficient?

[snapback]434287[/snapback]

correction : 128*25 = 3200, not 9472 :">

chrismc · September 2, 2008, 9:10am

This is my kernel

========================================

global void SPH(float* x,float* vx,float* rho,float* p,float* mass,float* hsml,float* du,float* tdsdt,float* indvxdt,float* c,float* ardvxdt,float* avdudt,float* v_min,float* u_min,float* dx,float* dvx,float* av,float* ds,float* t,float* dedt,float* dvxdt,int* itype,int* dfindexstart,int* dfindexstop,float* w,float* dwdx,

int* pair_i,int* pair_j,float* u)

{

int	maxtimestep;

float dt;

	dt = 0.005;



maxtimestep = MAXITS; //should be 200 = 1 second



inputd(mass,hsml,itype,vx,x,u,p,rho);

__syncthreads();

	

time_integrationd(x,vx,rho,p,mass,hsml,du,tdsdt,indvxdt,c,ardvxdt,avdudt,v_min,u_min,dx,dvx,av,ds,t,dedt,dvxdt,itype,dfindexstart,dfindexstop,w,dwdx,pair_i,pair_j,dt,maxtimestep,u);

}

=================================

time_integrationd calls 7 other device functions.

I count only 2 register variables; maxtimestep and dt.

As stated the error is reported for >=193 threads.

The functions with register varaibles are

SPH 2

inputd 1

time_integration 3

single_step 2

direct_find 11

kerneld 3

p_gas 1

sum_density 12

int_forced 14

art_visc 25

The sequence of function calls is

time_integration()

{

for(i=0 ; i<maxtimestep ; i++) single_step();

}

single_step()

{

direct_find

__syncthreads

sum_density

__syncthreads

int_forced

__syncthreads

art_visc

__syncthreads

}

theMarix · September 2, 2008, 9:21am

Adding “–ptxas-options=”-v"" to your compiler arguments should make the compiler print the number of registers needed per thread. Multiply that with you block size. As the whole block must fit into one MP that number has to be <= 8192.

chrismc · September 2, 2008, 10:04am

Just tried that and got

ptxas info : compiling entry function SPH …

ptxas info : used 39 registers, 249+240 bytes smem, 72 bytes cmem[1]

I also tried the -cubin option expecting to find a .cubin file in the working dir but could not find one.

chrismc · September 2, 2008, 10:12am

I currently have one block of N threads, where N = 32i, i=1,2,3,… i.e. multiples of warps.

8192/39 = maximum of 210 threads

this would explain why the resource error occurs for 7 warps (224 threads), but does not explain why the resource error occurs for 6 warps +1 thread = 193 threads, unless because with 193 threads I require 7 warps i.e. 1 thread does not execute but the whole 7th warp does, then effectively 7 warps of registers are required even though I only use 1 thread from the 7th warp.

Is that it?

chrismc · September 2, 2008, 10:20am

So why are 39 registers required per thread? I don’t understand how that is calculated. Can anyone explain how that is calculated?

The kernel is passed 29 variables, which passes all those down to time_integration plus two more variables = 31.

:huh:

theMarix · September 2, 2008, 10:23am

As far as I understood the architecture that is it. The hardware is 8 way SIMD and therefore can only execute full warps. There is no finer granularity in thread creation. That’s why your code should also always boundary check.

chrismc · September 2, 2008, 10:43am

The kernel calls 2 functions;

inputd requires 8 variables

time_integration requires 31 variables

31 + 8 = 39, even though the 8 for inputd are also included in the 31 for time_integration.

How does declaring the variables as device affect this?

Tigga · September 2, 2008, 10:52am

Working out register usage by counting variables is not a good idea. Firstly I think the input variables to the kernel are in shared memory and secondly it’s just not how it works. For example - registers are used for intermediate calculations where a variable may not be defined.

While registers are related to your variables they are not identical.

chrismc · September 2, 2008, 11:11am

OK, I’m just guessing. It is a bit of a coincidence though.

I’m just a bit concerned at how the number of regsiter vars per thread is calculated. I don’t know. So how am I to amend my code without any clues as to which variables are required for the kernel launch?

Linny · September 2, 2008, 11:15am

You could always use the occupancy calculator in order to find out how many threads you can actually run per block…

chrismc · September 2, 2008, 11:34am

Yes, I understand that. But that requires a lot of trial and error work in amending the source code. I was hoping for a more scientific way kowing how the registers per thread value is calculated and which variables the registers are used for.

Is that info not reported anywhere? Surely if the -cubin option can calculate it then the compiler knows which variables are using those registers?

BTW I tried compiling with the -cubin option and could find nothing being reported. What needs to have been installed for the -cubin option to work?

alex_dubinsky · September 2, 2008, 7:21pm

Has someone told you about -maxrregcount=N yet?

E.D_Riedijk · September 2, 2008, 7:34pm

The amount of registers per thread is not calculated. Here are the steps to convert your source code to something that will run on the device:

compile your .cu sourcecode to a .ptx file. Here each new variable gets a new register, so you see a lot of registers in there (if you add -keep, the .ptx will not be deleted)
converted the .ptx with ptxas into machine code (.cubin). Here agressive register optimization is performed. After all that is done, it is known how many registers are needed per thread, and that is what is reported.

It is advisable to always use -ptxas-options=-v and fill in the reported values in the occupancy calculator to see how many threads your can request per block. That beats the trial and error approach ;)

Topic		Replies	Views
Too Many Resources Requested CUDA Programming and Performance	8	1486	June 11, 2009
too many resources requested for launch what does it exactly mean? CUDA Programming and Performance	3	1598	January 28, 2009
How to calculate register resource correctly? I met strange problem on calculate resource such as re CUDA Programming and Performance	2	1336	July 27, 2010
Too many resources requested for launch Legacy PGI Compilers	3	8100	September 23, 2010
Getting "too many requested resources to launch" CUDA Programming and Performance	3	3948	January 6, 2009
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2489	October 12, 2021
cudaErrorLaunchOutOfResources aka "too many resources requested for launch" CUDA Programming and Performance	3	10349	July 29, 2013
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	26341	December 16, 2009
reduce the no of register per thread used CUDA Programming and Performance	2	2973	October 15, 2009
too many resources requested for launch how to find out what resources? CUDA Programming and Performance	3	22820	November 22, 2007

TOO MANY RESOURCES REQUESTED FOR LAUNCH

Related topics