Hi,
I was trying to print out debugging information from within the thread, but no information was printed out if I used dim3 type to specify threads per block. Below is a testing code:
attributes(global) subroutine test_output(n, m, l)
implicit none
integer(kind=8) :: n, m, l, i, j, k
i = ThreadIdx%x + (BlockIdx%x - 1) * BlockDim%x
j = ThreadIdx%y + (BlockIdx%y - 1) * BlockDim%y
k = ThreadIdx%z + (BlockIdx%z - 1) * BlockDim%z
if (i <= n) then
if (j <= m) then
if (k <= l) then
write(*,*) i, j, k
end if
end if
end if
end subroutine test_output
program test
use cudafor
implicit none
integer(kind=8) :: n, m, l
integer(kind=8), device :: n_dev, m_dev, l_dev
n = 5; m = 5; l = 1
n_dev = n; m_dev = m; l_dev = l
call test_output<<<dim3(1,1,1),dim3(128,128,128)>>>(n_dev, m_dev, l_dev) ! Test 1
call test_output<<<dim3(n/128,m/128,l/128),128>>>(n_dev, m_dev, l_dev) ! Test 2
call test_output<<<dim3(n,m,l),1>>>(n_dev, m_dev, l_dev) ! Test 3
call test_output<<<1,dim3(128,128,128)>>>(n_dev, m_dev, l_dev) ! Test 4
call test_output<<<1,128>>>(n_dev, m_dev, l_dev) ! Test 5
end program
For the above 5 tests, only Tests 2, 3, 5 produced outputs.
Test 2 gave
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
Test 3 gave
5 5 1
3 3 1
3 4 1
4 5 1
2 4 1
1 2 1
5 1 1
1 5 1
3 1 1
3 5 1
4 2 1
4 4 1
5 3 1
2 3 1
1 1 1
2 2 1
1 4 1
2 5 1
4 1 1
5 2 1
4 3 1
1 3 1
3 2 1
5 4 1
2 1 1
Test 5 gave
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
Test 3 would be the preferred result but the usage of threads does not seem to be efficient. Is there a more sensible way to specify the number of threads per block?
(I compiled with “pgf90 -Mcuda=cc60,ptxinfo,nollvm test.cuf -o test.exe”.)
Thanks,
Jimmy