Is volatile working in CUDA Fortran?
I’m relying in intra-warp implicit synchronization for an algorithm and I’m getting wrong results
do i = 1, LOG_WARP_SIZE, 1
offset = 2**(i-1)
sum = sum + reduction_shared(s - offset)
reduction_shared(s) = sum
end do
gets a different result (for same warp threads) than
do i = 1, LOG_WARP_SIZE, 1
offset = 2**(i-1)
sum = sum + reduction_shared(s - offset)
reduction_shared(s) = sum
call syncthreads()
end do
[/code]
May I ask how does CUDA Fortran organize the warps?
I CUDA C you have warp 0 gets threads 0 to 31, warp 1 threads 32 to 63 etc etc…
Could it be that CUDA Fortran makes warp 1 threads 1, 33, 65, …etc? like column major order or something like that?..
I’m a bit puzzled with intra-wrap lack of implicit synchronization.
Best regards,
Is volatile working in CUDA Fortran?
As of 12.4 yes. Though, the “volatile” keyword is simply passed through to the generated low level CUDA C so it’s possible that the problem is with the back-end CUDA C compiler. You can see the generated CUDA code via the flag “-Mcuda=keepgpu”.
May I ask how does CUDA Fortran organize the warps?
The “threadidx” and “blockidx” are base 1 since that’s Fortran, but there is no change in the way warps are organized. The first warp would get the first 32 threads, no matter if the base index is a 1 or 0. (Under the hood the indexing gets adjusted when translated to CUDA C)
Hope this helps,
Mat
Well we’ve been making some tests and I can confirm that implicit intrawarp synchronization is not working properly…
We are trying to figure out how is working, it is not necesarly a bad thing, it seems to provide a better coalesced access without the need of transpose read operation, nevertheless it should be documented.
The following kernels produce different results when they should not.
attributes(global) subroutine test1warp(input, result)
integer, dimension(:) :: input, result
integer, shared, dimension(:) :: shared_values(SHAREDSTRIDE * NUMWARPS1)
integer :: tid, warp, lane, sum
tid = threadIdx%x
warp = (tid-1)/WARP_SIZE
lane = mod((tid-1), WARP_SIZE) + 1
index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
shared_values(index - 16) = 0
call syncthreads() !yep this may be important
sum = input(tid)
shared_values(index) = sum
!synchthreads to make sure everything is ok
call syncthreads()
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!manually unrolled loops to make sure everything is ok
sum = sum + shared_values(index - 1)
shared_values(index) = sum
sum = sum + shared_values(index - 2)
shared_values(index) = sum
sum = sum + shared_values(index - 4)
shared_values(index) = sum
sum = sum + shared_values(index - 8)
shared_values(index) = sum
sum = sum + shared_values(index - 16)
shared_values(index) = sum
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!and then we write out
call syncthreads()
result(tid) = shared_values(index)
end subroutine test1warp
And… the explicitly synchronized version
[/code]
attributes(global) subroutine test1warpSync(input, result)
integer, dimension(:) :: input, result
integer, shared, dimension(:) :: shared_values(SHAREDSTRIDE * NUMWARPS1)
integer :: tid, warp, lane, sum
tid = threadIdx%x
warp = (tid-1)/WARP_SIZE
lane = mod((tid-1), WARP_SIZE) + 1
index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
shared_values(index - 16) = 0
call syncthreads() !yep this may be important
sum = input(tid)
shared_values(index) = sum
!synchthreads to make sure everything is ok
call syncthreads()
!!!
!manually unrolled loops to make sure everything is ok
sum = sum + shared_values(index - 1)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 2)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 4)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 8)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 16)
shared_values(index) = sum
!!!
!and then we write out
call syncthreads()
result(tid) = shared_values(index)
end subroutine test1warpSync
[/code]
You can try launching the kernels with any input (integer array) if you try with 32 threads (1 warp) and 1 block you can easily see the difference.
Vicente wil try to send the code of the tests, he’ll be making some more tests during the weekend.
best regards,
We are now almost completly certain that the problem is with the “volatile” keyword… for sime reason is not taking it very seriusly.
After checking the GPU code generated it seems that is using pointers access the shared memory, probably those pointer should be marked as “volatile” as well.
Just an idea…