volatile in CUDA Fortran

Is volatile working in CUDA Fortran?
I’m relying in intra-warp implicit synchronization for an algorithm and I’m getting wrong results

do i = 1, LOG_WARP_SIZE, 1
		offset = 2**(i-1)
		sum = sum + reduction_shared(s - offset)
		reduction_shared(s) = sum
	end do

gets a different result (for same warp threads) than

do i = 1, LOG_WARP_SIZE, 1
		offset = 2**(i-1)
		sum = sum + reduction_shared(s - offset)
		reduction_shared(s) = sum
               call syncthreads()
	end do

[/code]

May I ask how does CUDA Fortran organize the warps?
I CUDA C you have warp 0 gets threads 0 to 31, warp 1 threads 32 to 63 etc etc…
Could it be that CUDA Fortran makes warp 1 threads 1, 33, 65, …etc? like column major order or something like that?..

I’m a bit puzzled with intra-wrap lack of implicit synchronization.

Best regards,

Is volatile working in CUDA Fortran?

As of 12.4 yes. Though, the “volatile” keyword is simply passed through to the generated low level CUDA C so it’s possible that the problem is with the back-end CUDA C compiler. You can see the generated CUDA code via the flag “-Mcuda=keepgpu”.

May I ask how does CUDA Fortran organize the warps?

The “threadidx” and “blockidx” are base 1 since that’s Fortran, but there is no change in the way warps are organized. The first warp would get the first 32 threads, no matter if the base index is a 1 or 0. (Under the hood the indexing gets adjusted when translated to CUDA C)

Hope this helps,
Mat

Well we’ve been making some tests and I can confirm that implicit intrawarp synchronization is not working properly…

We are trying to figure out how is working, it is not necesarly a bad thing, it seems to provide a better coalesced access without the need of transpose read operation, nevertheless it should be documented.

The following kernels produce different results when they should not.


attributes(global) subroutine test1warp(input, result)
	integer, dimension(:)			:: input, result
	integer, shared, dimension(:)	:: shared_values(SHAREDSTRIDE * NUMWARPS1)
	integer							:: tid, warp, lane, sum
	
	tid = threadIdx%x
	warp = (tid-1)/WARP_SIZE
	lane = mod((tid-1), WARP_SIZE) + 1
	index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
	shared_values(index - 16) = 0
	call syncthreads() !yep this may be important
	sum = input(tid)
	shared_values(index) = sum
	!synchthreads to make sure everything is ok
	call syncthreads()
	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
	!manually unrolled loops to make sure everything is ok
	sum = sum + shared_values(index - 1)
	shared_values(index) = sum
	sum = sum + shared_values(index - 2)
	shared_values(index) = sum
	sum = sum + shared_values(index - 4)
	shared_values(index) = sum
	sum = sum + shared_values(index - 8)
	shared_values(index) = sum
	sum = sum + shared_values(index - 16)
	shared_values(index) = sum
	!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
	!and then we write out
	call syncthreads()
	result(tid) = shared_values(index)

end subroutine test1warp

And… the explicitly synchronized version
[/code]
attributes(global) subroutine test1warpSync(input, result)
integer, dimension(:) :: input, result
integer, shared, dimension(:) :: shared_values(SHAREDSTRIDE * NUMWARPS1)
integer :: tid, warp, lane, sum

tid = threadIdx%x
warp = (tid-1)/WARP_SIZE
lane = mod((tid-1), WARP_SIZE) + 1
index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
shared_values(index - 16) = 0
call syncthreads() !yep this may be important
sum = input(tid)
shared_values(index) = sum
!synchthreads to make sure everything is ok
call syncthreads()
!!!
!manually unrolled loops to make sure everything is ok
sum = sum + shared_values(index - 1)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 2)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 4)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 8)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 16)
shared_values(index) = sum
!!!
!and then we write out
call syncthreads()
result(tid) = shared_values(index)

end subroutine test1warpSync
[/code]
You can try launching the kernels with any input (integer array) if you try with 32 threads (1 warp) and 1 block you can easily see the difference.

Vicente wil try to send the code of the tests, he’ll be making some more tests during the weekend.

best regards,

We are now almost completly certain that the problem is with the “volatile” keyword… for sime reason is not taking it very seriusly.
After checking the GPU code generated it seems that is using pointers access the shared memory, probably those pointer should be marked as “volatile” as well.

Just an idea…