If I understood correctly, all the variables declared in a device routine are saved in the device global memory.
The following is a part of code taken from the PGI CUDA Fortran Programming Guide
attributes(global) subroutine mmul_kernel( A, B, C, N, M, L )
real :: A(N,M), B(M,L), C(N,L)
integer, value :: N, M, L
integer :: i, j, kb, k, tx, ty
! submatrices stored in shared memory
real, shared :: Asub(16,16), Bsub(16,16)
! the value of C(i,j) being computed
real :: Cij
! Get the thread indices
tx = threadidx%x
ty = threadidx%y
! This thread computes C(i,j) = sum(A(i,:) * B(:,j))
i = (blockidx%x-1) * 16 + tx
j = (blockidx%y-1) * 16 + ty
Cij = 0.0
[ other code]
My question is: if i, j, Cij, and so on, are in the global device memory, and therefore are not private to each thread, why there aren’t data races or other problems accessing the same place in memory?
I’m writing some programs in CUDA fortran and I don’t understand when a variable is private to a single thread (and if it is possible).
Thank you