help: global subroutine not executed when n is too large in <<<x, n>>>

Hello, I have a program which was written with Cuda fortran (PGI community edition 17.10), it turned out that when n in <<<x,n>>> is too large, the instructions in a global subroutine is not executed. The structure of the code is:

attributes(global) subroutine overl_kernel2(IA,JA,xpacd,ypacd,nx,ny,nt,nsize,abandd,x24d,w24d)
implicit none
integer, value :: nx,ny,nt,nsize
integer i,j,ka
real8 wr00,ovswp1
8, device,intent(IN) :: x24d(24),W24d(24)
real8,device,intent(IN) :: xpacd(0:nx),ypacd(0:ny)
4,device :: abandd(nsize)
integer, intent(IN) :: IA(nsize),JA(nsize)

ka = (blockidx%x-1)*blockdim%x + threadidx%x
if(ka<=nsize) then

end subroutine overl_kernel2

It is called in another subroutine with:

call overl_kernel2<<<(nsize-1)/nthr+1,nthr>>>(IAd,JAd,xpacd,ypacd,nx,ny,nt,nsize,abandd,x24d,w24d)

It turned out that when “nthr” is too large, say, “nthr=1024”, the following instruction in subroutine “overl_kernel2”


is not executed at all, i.e., the function ovswp1 is not used. If, on the other hand, we change “nthr” to smaller values, say, “nthr=512”, the program works well.

I am wondering what can the cause of this problem be?

I am using PGI fortran community edition 17.10 with a Nvidia Quadro P4000 on a laptop with 32GB of memory with 64 bit Windows 10 Home edition.

thank you very much!

Add proper error checking. The kernel either is (1) never launched because it exceeds a resource limit, or (2) it runs and is terminated abnormally because it exceeds the two-second kernel run time limit imposed by the operating system’s GUI watchdog timer.

Thanks, njuffa! I added the following code after the <<<, >>> line:

ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync/= cudaSuccess) write (,) “Sync kernel error:”, cudaGetErrorString(ierrSync)
if (ierrAsync/= cudaSuccess) write(,) “Async kernel error:”, cudaGetErrorString(ierrAsync)

and it gives me the following information:

Sync kernel error: too many resources requested for launch

this gives me some clue. I shell check for the solutions. Thanks!

Thanks, I have to ask for help again:

My driver card has 1792 cuda cores, and the other information are the following:

Device Number: 0
GetDeviceProperties for device 0: Passed
Device Name: Quadro P4000
Compute Capability: 6.1
Number of Multiprocessors: 14
Max Threads per Multiprocessor: 2048
Global Memory (GB): 8.000

Execution Configuration Limits
Max Grid Dims: 2147483647 x 65535 x 65535
Max Block Dims: 1024 x 1024 x 64
Max Threads per Block: 1024

why n in <<<x, n>>> is larger than 1024 will cause “too many resources requested for launch” problem? Also, the same program works in Linux with n=1024 using a GTX 670! I don’t understand why with a more powerful quadro P4000, this does not work. Is this because of the Windows system?


The most likely culprit is register usage. Register usage for the same code compiled for different GPUs can be different. Register usage for the same code can also vary from one CUDA version to the next. There are many treatments of the registers-per-thread limitations associated with CUDA codes, if you want to search for them.

And no CUDA code on the planet will work anywhere, on any GPU, if n is 1025 or larger. The limit for all CUDA codes on n (threads per block) is 1024. This is covered in the CUDA programming guide documentation.