Hello,
I have a Fortran kernel code like the following example below which works without the addition of a while loop. After adding the while loop, the compiler seems to remove the entire kernel code away (an optimization?), even when compiling with -O0. Why is that happening?
I want to execute the kernel with all threads defined in just one block.
Thanks
Working code (ceiling(real(N)/1024) blocks, 1024 threads per block):
idx = getThreadID
if(idx <= N)
do_computations()
endif
Not working code (one block, 1024 threads):
count = 0
do while (count < N)
idx = getThreadID + count
if(idx <= N)
do_computations()
endif
count = count+1024
enddo
Hi Henrique,
Sorry but there’s not enough information here to help. Can you please provide a reproducing example?
I’ve included while loops in CUDA Fortran kernels before and it’s worked fine. Also given the error still fails at -O0, seems to indicate that something else is wrong.
-Mat
A code like this works, calling as “call calcsub<<<nblocks,1024>>>()”, where “nblocks = ceiling(real(N)/1024)”.
module calc
contains
attributes(global) subroutine calcsub
use vars ! where N and A are defined, A is of size N
integer idx
idx = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(idx <= N)then
A(idx) = A(idx)+1.0
endif
end subroutine
end module
This code does not work, calling as “call calcsub<<<1,1024>>>()”.
module calc
contains
attributes(global) subroutine calcsub
use vars ! where N and A are defined, A is of size N
integer idx, count
count = 0
do while (count < N)
idx = blockDim%x*(blockIdx%x-1)+threadIdx%x + count
if(idx <= N)then
A(idx) = A(idx)+1.0
endif
count = count+1024
enddo
end subroutine
end module
Since I want to use just one block, the while loop is needed so that each thread can access more than one memory position in order to update the entire array A.
Hi Henrique,
I pulled together the following test code from your samples, but for good or bad, I get the same results from both kernels. Any additional information that you can provide to help determine the issue?
Thanks,
Mat
% cat test.cuf
module vars
integer, device :: N
integer :: Nh
real, allocatable, dimension(:), device :: A, B
real, allocatable, dimension(:) :: A_h, B_h
end module vars
module calcA
contains
attributes(global) subroutine calcAsub
use vars ! where N and A are defined, A is of size N
integer idx
idx = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(idx <= N)then
A(idx) = A(idx)+1.0
endif
end subroutine calcAsub
end module calcA
module calcB
contains
attributes(global) subroutine calcBsub
use vars ! where N and A are defined, A is of size N
integer idx, count
count = 0
do while (count < N)
idx = blockDim%x*(blockIdx%x-1)+threadIdx%x + count
if(idx <= N)then
B(idx) = B(idx)+1.0
endif
count = count+1024
enddo
end subroutine calcBsub
end module calcB
program test
use cudafor
use vars
use calcA
use calcB
implicit none
integer :: i, err
Nh = 10000
N = Nh
allocate(A_h(Nh))
allocate(A(Nh))
allocate(B_h(Nh))
allocate(B(Nh))
A_h=2.0
B_h=2.0
A=A_h
B=B_h
print *, "Before A=", A_h(12:15)
print *, "Before B=", B_h(12:15)
call calcAsub<<<ceiling(real(Nh)/1024),1024>>>()
call calcBsub<<<1,1024>>>()
A_h=A
B_h=B
print *, "After A=", A_h(12:15)
print *, "After B=", B_h(12:15)
err = 0
do i=1,Nh
if (A_h(i) .ne. B_h(i) ) then
err = err + 1
endif
enddo
print *, "Num errors: ", err
deallocate(A_h)
deallocate(B_h)
deallocate(A)
deallocate(B)
end program test
% pgfortran test.cuf -fast -V19.10 ; a.out
Before A= 2.000000 2.000000 2.000000 2.000000
Before B= 2.000000 2.000000 2.000000 2.000000
After A= 3.000000 3.000000 3.000000 3.000000
After B= 3.000000 3.000000 3.000000 3.000000
Num errors: 0
I think I found the problem, but I’m still looking for a way to solve it.
I added the compilation flags “-Mcuda=ptxinfo -v” from which I saw the problem has to do with the number of registers used (similar to cudaErrorLaunchOutOfResources aka "too many resources requested for launch" - CUDA Programming and Performance - NVIDIA Developer Forums).
From deviceQuery: “Total number of registers available per block: 65536”
The compilation of the example code gives: 9 and 12 registers for calcAsub and calcBsub, respectively, so 91024 = 9216 and 121024 = 12288.
I implemented 3 CUDA kernels in the scientific model I’m working with. From the compilation with the above flags I get: 32, 80, and 110 registers for kernels 1, 2, and 3, respectively. The problem occurs with kernel 2 because 80*1024 = 81920 > 65536.
How I can solve this issue?
Ok, so the real issue is that the kernel itself is failing to launch due to resource limitation (i.e. the number of registers). To fix you need to reduce the register usage by either lowering the number of threads per block (max 768 in your case), setting the flag “-Mcuda=maxregcount:64”, modifying the code to reduce the number of intermediary variables (i.e. local scalars or compiler generated temp scalars to hold address computation of array), or splitting computation into multiple kernels.
The later two options are the most difficult but viable if needed. Setting the max reg count is the easiest option but can be detrimental to performance since memory is still required for the remaining 16 registers which then gets spilled. If it just spills to L2 cache, then it’s not too bad, but if it spills to global memory it can negatively impact performance. You’ll need to experiment to see the actual impact on your code.
The best option would be to use more blocks and reduce the number of threads per block. Using a single block will severely underutilized the device given only one multiprocessor will be used. The only time I’d use a single block if I were launching multiple kernels on different streams to take advantage of the other multiprocessors.
-Mat
Thanks for your reply. Actually, I want to analyze the performance when using all threads on only one block. Compiling with maxregcount solved the problem, and performance degraded as expected. However, it’s strange how the addition of a simple while loop can change the behavior of a code on GPU.