Too many resources requested for launch

Hi everyone,

While debugging a kernel code I noticed I received the following error:

“too many resources requested for launch”

At first I thought that my kernel exceeded the 256 byte argument limit and so I used module device data to fix the issue. However,the same error persists. From Nvidia forums I understand that it might mean I am requesting too many registers or shared memory but I do not know how to get around this issue. I am also using the -Mcuda=ptxinfo flag to find out more but it does not give me any information.

The code I used to debug the kernel comes from the following forum post:

istat = cudathreadsynchronize()
errCode = cudaGetLastError()
if (errCode .gt. 0) then
       print *, errCode
       stop 'Error! Kernel failed!'
endif

If possible, could anyone tell me how to further investigate and fix this issue? Thank you for your time.

-Chris

Hi Chris,

I haven’t hit this problem myself so don’t know for sure. However, it seems correct that you’re running out of registers, shared memory, or possibly constant memory.

If the problem is with registers, you can try using the flag “-Mcuda=maxregcount:” where “n” is the maximum number of registers to use.

I am also using the -Mcuda=ptxinfo flag to find out more but it does not give me any information.

Just to verify, are you compiling code that includes global kernels? It should give you the following information, but it is a new feature in 10.6 so there could be a bug. If so, please send in a report to PGI Customer Service (trs@pgroup.com).

% pgf90 -Mcuda=ptxinfo -O2 -V10.6 stream_cudafor.cuf -Mfixed
ptxas info : Compiling entry function ‘stream_triad’
ptxas info : Used 8 registers, 36+16 bytes smem
ptxas info : Compiling entry function ‘stream_add’
ptxas info : Used 6 registers, 28+16 bytes smem
ptxas info : Compiling entry function ‘stream_scale’
ptxas info : Used 6 registers, 28+16 bytes smem
ptxas info : Compiling entry function ‘stream_copy’
ptxas info : Used 4 registers, 20+16 bytes smem

  • Mat

Hi Mat,

In order to describe the problem with further detail I posted part of my kernel code:


 
     if(i_ortho.eq.0)then

     do m = 1, m_blk(myproc)

      i = threadidx%x +  i_b(m) - 1
      j = blockidx%x +  j_b(m) - 1
      k = blockidx%y +  k_b(m) - 1

      if(i .LE. i_e(m) .and. j .LE. j_e(m)
     1                     .and. k .LE. k_e(m))then

c    Part 1
      vec_out(i,j,k,m) = 0.0
      vec_out(i,j,k,m) = ( ap_dev(19,i,j,k,m) * vec_in(i,j,k,m)
     1               - ( ap_dev(3,i,j,k,m)  * vec_in(i+1, j,k,m)
     1                 + ap_dev(4,i,j,k,m)  * vec_in(i-1,j,k,m)
     1                 + ap_dev(1,i,j,k,m)  * vec_in(i,j+1,k,m)
     1                 + ap_dev(2,i,j,k,m)  * vec_in(i,j-1,k,m)
     1                 + ap_dev(5,i,j,k,m)  * vec_in(i,j,k+1,m)
     1                 + ap_dev(6,i,j,k,m)  * vec_in(i,j,k-1,m)
     1                 + ap_dev(7,i,j,k,m) * vec_in(i+1,j+1,k,m)
     1                 + ap_dev(8,i,j,k,m) * vec_in(i-1,j+1,k,m))
     1                 ) * sps_dev(i,j,k,m)

c    Part 2
      vec_out(i,j,k,m) = vec_out(i,j,k,m)
     1                 + ( ap_dev(9,i,j,k,m) * vec_in(i+1,j-1,k,m)
     1                 + ap_dev(10,i,j,k,m) * vec_in(i-1,j-1,k,m)
     1                 + ap_dev(11,i,j,k,m) * vec_in(i,j+1,k+1,m)
     1                 + ap_dev(12,i,j,k,m) * vec_in(i,j+1,k-1,m)
     1                 + ap_dev(13,i,j,k,m) * vec_in(i+1,j,k+1,m)
     1                 + ap_dev(14,i,j,k,m) * vec_in(i+1,j,k-1,m)
     1                 + ap_dev(15,i,j,k,m) * vec_in(i,j-1,k+1,m)
     1                 + ap_dev(16,i,j,k,m) * vec_in(i,j-1,k-1,m)
     1                 + ap_dev(17,i,j,k,m) * vec_in(i-1,j,k+1,m)
     1                 + ap_dev(18,i,j,k,m) * vec_in(i-1,j,k-1,m)
     1                 ) * sps_dev(i,j,k,m)
      end if
      end do

      elseif(i_ortho.eq.1)then
     ... etc

The output from ptxinfo is:

ptxas info : Used 41 registers, 100+16 bytes smem, 496 bytes cmem[0], 4 bytes cmem[1]

The interesting part about this code is that the problem I am solving does not actually enter the first if-statement (In other words, i_ortho = 1). Even though I am not actually entering the first if-statement during execution, the program will not run because of the error: Too many resources requested for launch.

The only way I found to get around this issue is to comment out part two of the first if-statement. When I do comment part two out ptxinfo is:

ptxas info : Used 32 registers, 100+16 bytes smem, 496 bytes cmem[0], 4 bytes cmem[1]

And the code executes perfectly. However, in cases were i_ortho = 0 the calculation will be incomplete.

Do you have any suggestions as to what to do?

Thankfully,

-Chris

Hi Chris,

What’s you block size (i.e. the number of threads in block)? I suspect it’s 256. The maximum number of register is 8192 per block. So when you’re using 32 registers per thread with 256 threads, you have exactly 8192 registers. At 41 registers per thread, your over the max.

The solutions are to either reduce the number of threads per block or the reduce the number of registers. To reduce the number of registers use the flag “-Mcuda=maxregcount:32”. Note that the some performance will be lost either way so try both. Most of the time, it’s better to reduce the number of registers rather than the number of threads.

Hope this helps,
Mat