Thread STDOUT Suppressed with Dim3 Block-Thread Structure?

Hi,

I was trying to print out debugging information from within the thread, but no information was printed out if I used dim3 type to specify threads per block. Below is a testing code:

attributes(global) subroutine test_output(n, m, l)
  implicit none
  integer(kind=8) :: n, m, l, i, j, k
  i = ThreadIdx%x + (BlockIdx%x - 1) * BlockDim%x
  j = ThreadIdx%y + (BlockIdx%y - 1) * BlockDim%y
  k = ThreadIdx%z + (BlockIdx%z - 1) * BlockDim%z
  if (i <= n) then
    if (j <= m) then
      if (k <= l) then
        write(*,*) i, j, k
      end if
    end if
  end if
end subroutine test_output

program test
  use cudafor
  implicit none
  integer(kind=8) :: n, m, l
  integer(kind=8), device :: n_dev, m_dev, l_dev

  n = 5; m = 5; l = 1
  n_dev = n; m_dev = m; l_dev = l

  call test_output<<<dim3(1,1,1),dim3(128,128,128)>>>(n_dev, m_dev, l_dev) ! Test 1
  call test_output<<<dim3(n/128,m/128,l/128),128>>>(n_dev, m_dev, l_dev)   ! Test 2
  call test_output<<<dim3(n,m,l),1>>>(n_dev, m_dev, l_dev)                 ! Test 3
  call test_output<<<1,dim3(128,128,128)>>>(n_dev, m_dev, l_dev)           ! Test 4
  call test_output<<<1,128>>>(n_dev, m_dev, l_dev)                         ! Test 5 

end program

For the above 5 tests, only Tests 2, 3, 5 produced outputs.
Test 2 gave
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
Test 3 gave
5 5 1
3 3 1
3 4 1
4 5 1
2 4 1
1 2 1
5 1 1
1 5 1
3 1 1
3 5 1
4 2 1
4 4 1
5 3 1
2 3 1
1 1 1
2 2 1
1 4 1
2 5 1
4 1 1
5 2 1
4 3 1
1 3 1
3 2 1
5 4 1
2 1 1
Test 5 gave
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1

Test 3 would be the preferred result but the usage of threads does not seem to be efficient. Is there a more sensible way to specify the number of threads per block?

(I compiled with “pgf90 -Mcuda=cc60,ptxinfo,nollvm test.cuf -o test.exe”.)

Thanks,

Jimmy

Hi,

I just realized that I got the problem because having a three-dimensional thread of dim3(128,128,128) has way exceeded the maximum capacity of 1024 threads per block. (I have asked the same question here before but was just reminded by a GPU-guru friend.) The block structure can be three-dimensional in order to make the more use of the threads but within the blocks, a one-dimensional thread suffices to do the job.

A code that works may look like the following:

attributes(global) subroutine test_output(n, m, l)
! 
! Three-dimensional blocks, one-dimensional thread.
!
  implicit none
  integer(kind=8) :: n, m, l, i, j, k
  i = ThreadIdx%x + (BlockIdx%x - 1) * BlockDim%x
  j = ThreadIdx%x + (BlockIdx%y - 1) * BlockDim%y
  k = ThreadIdx%x + (BlockIdx%z - 1) * BlockDim%z
  if (i <= n) then
    if (j <= m) then
      if (k <= l) then
        write(*,*) i, j, k
      end if
    end if
  end if
end subroutine test_output

program test
  use cudafor
  implicit none
  integer(kind=8) :: n, m, l
  integer(kind=8), device :: n_dev, m_dev, l_dev

  n = 5; m = 5; l = 1
  n_dev = n; m_dev = m; l_dev = l

  call test_output<<<dim3(n/128+1,m/128+1,l/128+1),128>>>(n_dev, m_dev, l_dev)  

end program

Or it can be even simplified as to use
i = ThreadIdx%x + (BlockIdx%x - 1) * BlockDim%x
j = BlockIdx%y
k = BlockIdx%z
call test_output<<<dim3(n/128+1,m,l),128>>>(n_dev, m_dev, l_dev)

Thanks,

Jimmy

Glad you solved your problem. Also note that by default, I believe only 256 print statements make it back to the host per kernel launch. You can change that value using a call to cudaDeviceSetLimit