While loop not executed inside kernel

Hello,

I have a Fortran kernel code like the following example below which works without the addition of a while loop. After adding the while loop, the compiler seems to remove the entire kernel code away (an optimization?), even when compiling with -O0. Why is that happening?

I want to execute the kernel with all threads defined in just one block.

Thanks

Working code (ceiling(real(N)/1024) blocks, 1024 threads per block):

idx = getThreadID
if(idx <= N)
    do_computations()
endif

Not working code (one block, 1024 threads):

count = 0
do while (count < N)
    idx = getThreadID + count
    if(idx <= N)
        do_computations()
    endif
    count = count+1024
enddo

Hi Henrique,

Sorry but there’s not enough information here to help. Can you please provide a reproducing example?

I’ve included while loops in CUDA Fortran kernels before and it’s worked fine. Also given the error still fails at -O0, seems to indicate that something else is wrong.

-Mat

A code like this works, calling as “call calcsub<<<nblocks,1024>>>()”, where “nblocks = ceiling(real(N)/1024)”.

module calc

contains

attributes(global) subroutine calcsub

use vars ! where N and A are defined, A is of size N

integer idx

idx = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(idx <= N)then
    A(idx) = A(idx)+1.0
endif

end subroutine

end module

This code does not work, calling as “call calcsub<<<1,1024>>>()”.

module calc

contains

attributes(global) subroutine calcsub

use vars ! where N and A are defined, A is of size N

integer idx, count

count = 0
do while (count < N)
    idx = blockDim%x*(blockIdx%x-1)+threadIdx%x + count
    if(idx <= N)then
        A(idx) = A(idx)+1.0
    endif
    count = count+1024
enddo

end subroutine

end module

Since I want to use just one block, the while loop is needed so that each thread can access more than one memory position in order to update the entire array A.

Hi Henrique,

I pulled together the following test code from your samples, but for good or bad, I get the same results from both kernels. Any additional information that you can provide to help determine the issue?

Thanks,
Mat

% cat test.cuf
module vars
   integer, device :: N
   integer :: Nh
   real, allocatable, dimension(:), device :: A, B
   real, allocatable, dimension(:) :: A_h, B_h
end module vars

module calcA

contains

attributes(global) subroutine calcAsub

use vars ! where N and A are defined, A is of size N

integer idx

idx = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(idx <= N)then
    A(idx) = A(idx)+1.0
endif

end subroutine calcAsub

end module calcA

module calcB

contains

attributes(global) subroutine calcBsub

use vars ! where N and A are defined, A is of size N

integer idx, count

count = 0
do while (count < N)
    idx = blockDim%x*(blockIdx%x-1)+threadIdx%x + count
    if(idx <= N)then
        B(idx) = B(idx)+1.0
    endif
    count = count+1024
enddo

end subroutine calcBsub

end module calcB

program test

   use cudafor
   use vars
   use calcA
   use calcB
   implicit none
   integer :: i, err

   Nh = 10000
   N = Nh
   allocate(A_h(Nh))
   allocate(A(Nh))
   allocate(B_h(Nh))
   allocate(B(Nh))
   A_h=2.0
   B_h=2.0
   A=A_h
   B=B_h
   print *, "Before A=", A_h(12:15)
   print *, "Before B=", B_h(12:15)
   call calcAsub<<<ceiling(real(Nh)/1024),1024>>>()
   call calcBsub<<<1,1024>>>()
   A_h=A
   B_h=B
   print *, "After A=", A_h(12:15)
   print *, "After B=", B_h(12:15)
   err = 0
   do i=1,Nh
     if (A_h(i) .ne. B_h(i) ) then
        err = err + 1
     endif
   enddo
   print *, "Num errors: ", err
   deallocate(A_h)
   deallocate(B_h)
   deallocate(A)
   deallocate(B)
end program test

% pgfortran test.cuf -fast -V19.10 ; a.out
 Before A=    2.000000        2.000000        2.000000        2.000000
 Before B=    2.000000        2.000000        2.000000        2.000000
 After A=    3.000000        3.000000        3.000000        3.000000
 After B=    3.000000        3.000000        3.000000        3.000000
 Num errors:             0

I think I found the problem, but I’m still looking for a way to solve it.

I added the compilation flags “-Mcuda=ptxinfo -v” from which I saw the problem has to do with the number of registers used (similar to https://devtalk.nvidia.com/default/topic/569648/cudaerrorlaunchoutofresources-aka-quot-too-many-resources-requested-for-launch-quot-/).

From deviceQuery: “Total number of registers available per block: 65536”

The compilation of the example code gives: 9 and 12 registers for calcAsub and calcBsub, respectively, so 91024 = 9216 and 121024 = 12288.

I implemented 3 CUDA kernels in the scientific model I’m working with. From the compilation with the above flags I get: 32, 80, and 110 registers for kernels 1, 2, and 3, respectively. The problem occurs with kernel 2 because 80*1024 = 81920 > 65536.

How I can solve this issue?

Ok, so the real issue is that the kernel itself is failing to launch due to resource limitation (i.e. the number of registers). To fix you need to reduce the register usage by either lowering the number of threads per block (max 768 in your case), setting the flag “-Mcuda=maxregcount:64”, modifying the code to reduce the number of intermediary variables (i.e. local scalars or compiler generated temp scalars to hold address computation of array), or splitting computation into multiple kernels.

The later two options are the most difficult but viable if needed. Setting the max reg count is the easiest option but can be detrimental to performance since memory is still required for the remaining 16 registers which then gets spilled. If it just spills to L2 cache, then it’s not too bad, but if it spills to global memory it can negatively impact performance. You’ll need to experiment to see the actual impact on your code.

The best option would be to use more blocks and reduce the number of threads per block. Using a single block will severely underutilized the device given only one multiprocessor will be used. The only time I’d use a single block if I were launching multiple kernels on different streams to take advantage of the other multiprocessors.

-Mat

Thanks for your reply. Actually, I want to analyze the performance when using all threads on only one block. Compiling with maxregcount solved the problem, and performance degraded as expected. However, it’s strange how the addition of a simple while loop can change the behavior of a code on GPU.