Device Code Skipped at Runtime

Hi,

I am testing my CUDA Fortran code that calls a device code containing cuRAND_device subroutines. I would like to use 2D thread mapping, and each of the thread is given a random seed. I tried with different numbers of thread-per-block. Sometimes the device code seemed to be completely ignored and no random numbers were assigned to the array. It may happen when thread-per-block was larger and equal than 27. This did not seem to happen when I only used 1D thread/block structure.

The code was compiled with the following command:

 pgf90 -Mcuda=nollvm -Mcudalib=curand test.cuf -o test.exe

It can be run with the command:

> ./test.exe 50 10 32

I also tried what combination can cause the issue and I noticed that if I used -Mcuda=nollvm, this issue appears. Are there any constraints that keep NoLLVM from having more thread-per-block for a 2D structure? I did not seem to have the other way around since device cuRAND codes need NoLLVM to pass the compilation.

Some of the properties are as follows:

Device Name: Tesla M2090
Device Revision Number: 2.0
Global Memory Size: 6442123264
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535

The code is attached at the end of the message.

Thanks a lot for your help!

Jimmy

! RNG device module
module rng_kernel
  use curand_device
  contains
    attributes(global) subroutine rngenerator( a )
    real, device :: a(:), b(:)
    type(curandStateXORWOW) :: h
    integer(kind=8), parameter :: i8 = 8
    integer(kind=8) :: seed, seq, offset, ix, iy, iam, n
    
! 2D thread-block structure, 1 random seed per thread
    n = size(a)
    seq = 0_i8; offset = 0_i8
    ix = ThreadIdx%x + (BlockIdx%x - 1_i8) * BlockDim%x
    iy = ThreadIdx%y + (BlockIdx%y - 1_i8) * BlockDim%y
    iam = ix + (iy - 1_i8) * BlockDim%x * GridDim%x
    seed = iam*2_i8 + 5678_i8
    call curand_init(seed, seq, offset, h)
    if (iam <= n) then
      a(iam) = curand_uniform(h)
      write(*,*) iam, a(iam)
    end if

    return
    end subroutine rngenerator
end module rng_kernel

program test
  use rng_kernel
  use cudafor
  implicit none
  real, allocatable, device :: a(:)
  real, allocatable :: c(:)
  real :: rmean
  integer(kind=4) :: i
  integer(kind=8) :: n, nB=0, tPB=0
  logical :: passing
  character(len=100) :: fname
  character(len=100) :: arg
  type(dim3) :: griddims, blockdims

! Use-input parameters for total number of thread, griddim, and blockdim
  if (command_argument_count() < 1) then
    call show_help()
    stop
  end if
  do i = 1, command_argument_count()
    call get_command_argument(i, arg)
   !write(*,*) i, arg
    if (i .eq. 1) read(arg,'(i)') n
    if (i .eq. 2) read(arg,'(i)') nB
    if (i .eq. 3) read(arg,'(i)') tPB
  end do

  if (tPB == 0) tPB = 64
  if (nB == 0) nB = ceiling(sqrt(real(n))/real(tPB))
  griddims = dim3(nB,nB,1)
  blockdims = dim3(tPB,tPB,1)

  allocate(a(n))
  allocate(c(n))
  a = 0.0
  passing = .true.
  call rngenerator<<<griddims,blockdims>>> (a)  ! Calling device code !
  c = a
  write(*,*) 'Number of threads needed:         ', n
  write(*,*) 'Total number of threads generated:', nB*nB*tPB*tPB

  rmean = sum(c)/n
  if ((rmean .lt. 0.4) .or. (rmean .gt. 0.6)) then
    passing = .false.
    print *,"Mean is ",rmean," which fails"
  else
    print *,"Mean is ",rmean," which passes"

contains

  subroutine show_help()
    print *, 'usage: trand_uniform.exe nB tPB N'
    print *, ''
    print *, ' - parameters: '
    print *, '   N:     total number of the random number needed'
    print *, '   nB:    number of blocks (not utilized in method 1)'
    print *, '   tPB:   number of threads per block'
    print *, ''
  end subroutine show_help

end program test

Hi Jimmy,

I tried your code but it works fine for me when the threads per block argument is 32. However when you go above this, such as with the default value of 64, the total number of threads per block exceed the maximum 1024. (your block size is tPBxtPB so 32x32 = 1024, but 64x64 = 4096 which is too large). One suggestion is to add error checking to your code, otherwise the kernel will silently fail.

For example:

% cat test.CUF
! RNG device module
 module rng_kernel
   use curand_device
   contains
     attributes(global) subroutine rngenerator( a )
     real, device :: a(:), b(:)
     type(curandStateXORWOW) :: h
     integer(kind=8), parameter :: i8 = 8
     integer(kind=8) :: seed, seq, offset, ix, iy, iam, n

 ! 2D thread-block structure, 1 random seed per thread
     n = size(a)
     seq = 0_i8; offset = 0_i8
     ix = ThreadIdx%x + (BlockIdx%x - 1_i8) * BlockDim%x
     iy = ThreadIdx%y + (BlockIdx%y - 1_i8) * BlockDim%y
     iam = ix + (iy - 1_i8) * BlockDim%x * GridDim%x
     seed = iam*2_i8 + 5678_i8
     call curand_init(seed, seq, offset, h)
     if (iam <= n) then
       a(iam) = curand_uniform(h)
       write(*,*) iam, a(iam)
     end if

     return
     end subroutine rngenerator
 end module rng_kernel

 program test
   use rng_kernel
   use cudafor
   implicit none
   real, allocatable, device :: a(:)
   real, allocatable :: c(:)
   real :: rmean
   integer(kind=4) :: i
   integer         :: ierrSync, ierrAsync
   integer(kind=8) :: n, nB=0, tPB=0
   logical :: passing
   character(len=100) :: fname
   character(len=100) :: arg
   type(dim3) :: griddims, blockdims

 ! Use-input parameters for total number of thread, griddim, and blockdim
   if (command_argument_count() < 1) then
     call show_help()
     stop
   end if
   do i = 1, command_argument_count()
     call get_command_argument(i, arg)
    !write(*,*) i, arg
     if (i .eq. 1) read(arg,'(i)') n
     if (i .eq. 2) read(arg,'(i)') nB
     if (i .eq. 3) read(arg,'(i)') tPB
   end do

   if (tPB == 0) tPB = 64
   if (nB == 0) nB = ceiling(sqrt(real(n))/real(tPB))
   griddims = dim3(nB,nB,1)
   blockdims = dim3(tPB,tPB,1)

   allocate(a(n))
   allocate(c(n))
   a = 0.0
   passing = .true.
#ifdef DEBUG
   print *, "Launch Schedule:"
   print *, griddims
   print *, blockdims
#endif
   call rngenerator<<<griddims,blockdims>>> (a)  ! Calling device code !
#ifdef DEBUG
   ierrSync = cudaGetLastError()
   ierrAsync = cudaDeviceSynchronize()
   if (ierrSync /= cudaSuccess) write(*,*) &
     "Sync kernel error:", cudaGetErrorString(ierrSync)
   if (ierrAsync /= cudaSuccess) write(*,*) &
     "Async kernel error:", cudaGetErrorString(ierrAsync)
#endif
   c = a

   write(*,*) 'Number of threads needed:         ', n
   write(*,*) 'Total number of threads generated:', nB*nB*tPB*tPB

   rmean = sum(c)/n
   if ((rmean .lt. 0.4) .or. (rmean .gt. 0.6)) then
     passing = .false.
     print *,"Mean is ",rmean," which fails"
   else
     print *,"Mean is ",rmean," which passes"
   endif

 contains

   subroutine show_help()
     print *, 'usage: trand_uniform.exe nB tPB N'
     print *, ''
     print *, ' - parameters: '
     print *, '   N:     total number of the random number needed'
     print *, '   nB:    number of blocks (not utilized in method 1)'
     print *, '   tPB:   number of threads per block'
     print *, ''
   end subroutine show_help

 end program test

% pgf90 -Mcuda=nollvm -Mcudalib=curand -DDEBUG test.CUF -o test.exe
% ./test.exe 50 10 32 | grep Mean
 Mean is    0.5026075      which passes
% ./test.exe 50 10 64
 Launch Schedule:
           10           10            1
           64           64            1
 Sync kernel error:
 invalid configuration argument
 Number of threads needed:                                50
 Total number of threads generated:                   409600
 Mean is     0.000000      which fails
  • Mat

Hi Mat,

Thank you very much for your quick reply and answering my questions. I thought that it was due to some weird compilation options but it turned out that I just went past the maximum allowed number of threads. I do have some more questions that might have been directed to CUDA 101…:

  1. I am confused by the differences between “Maximum Threads per Block” and “Maximum Block Dimensions”. Now it seems that the three numbers “1024, 1024, 64” do not mean a total number of threads of 1024x1024x64. I am wondering what the three numbers really mean and whether/how I can utilize all three of them?

  2. I have tested and found out how (but not why) I was limited to 26 threads per dimension. If I turn off the write command within the device code, I can use up to 32x32 threads per block, but only 26x26 if I turn on the I/O. I also found the error message to be different if I go beyond 32x32 (when I turn on the I/O):
    32x32 → too many resources requested for launch
    33x33 → invalid configuration argument
    It seems to me that you did not encounter this even though you had turned on the I/O. Would it be related to the CUDA architecture (Tesla M2090, 2.0) or the PGF90 (16.9) that I am using?

Thanks,

Jimmy

  1. I am confused by the differences between “Maximum Threads per Block” and “Maximum Block Dimensions”. Now it seems that the three numbers “1024, 1024, 64” do not mean a total number of threads of 1024x1024x64. I am wondering what the three numbers really mean and whether/how I can utilize all three of them?

The product of the x, y, and z dimension sizes can not exceed 1024. But either the x or y dimension could be 1024. So 1024x1x1 is valid, or even 1x1024x1. 64x1x64 is not valid, but 4x4x64 is.

It seems to me that you did not encounter this even though you had turned on the I/O. Would it be related to the CUDA architecture (Tesla M2090, 2.0) or the PGF90 (16.9) that I am using?

This is because of register usage. Using the “ptxinfo” sub-option, I see that the registers per thread is 48. For 1024 threads, this means that the block needs 44032 registers. However, a M2090 only allows for 32678 registers per block. (you can get this info from the pgaccelinfo utility). As of CC30, the number of registers per block doubled to 65536.

% pgf90 -fast test.CUF -DDEBUG -Mcuda=nollvm,cc20,ptxinfo
ptxas info    : 77776 bytes gmem, 72 bytes cmem[2], 80 bytes cmem[14]
ptxas info    : Compiling entry function 'rng_kernel_rngenerator_' for 'sm_20'
ptxas info    : Function properties for rng_kernel_rngenerator_
    456 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 43 registers, 48 bytes cmem[0], 52 bytes cmem[16]

You can force the back-end device compiler to use fewer registers via the “maxregcount” sub-option.

pgf90 -fast test.CUF -DDEBUG -Mcuda=nollvm,cc20,ptxinfo,maxregcount:32

Hi Mat,

Thanks a lot for your explanation that resolves my confusion!

The total number of threads cannot exceed 1024 but the maximal threads allowed for each dimension can differ so it is read 1024, 1204, 64 instead of 1024x1024x64.

I also realized that more registers per thread are needed if in-device I/O is turned on, at least in my case. Without I/O, only 11 registers per thread are needed but 42 if I/O turned on. Having -Mcuda=maxregcount:32 did help.

Thanks again for your help!

Jimmy