CUDA fortran syncthreads_* functions

Hello,

I’m using PGI fortran 10.9 and I’m trying to make use of the syncthreads_and functions. I read that this and other syncthreads functions were implemented in version 10.6 [1].

Here’s the call I’m using

integer, device, intent( out )				:: valid

integer					:: my_valid

...

valid = syncthreads_and( my_valid )

which is whiting an attributes( global ) cuda kernel. This produces the error:

Could not resolve generic procedure syncthreads_and

I do have a ‘use cudafor’ at the beginning of the module. Is anyone able to give me an example of how to use this properly.

On a similar note, what happens when I do this kind of assignment where ‘valid’ is in device (global) memory so will each thread try to write the outcome of syncthreads_and to this memory address or does CUDA sort it out automatically?


[1] http://www.pgroup.com/support/release_2010.htm

Hi Martin,

The error occurs due to a type mismatch of “my_valid” due to a known compiler problem. The work around is to not use the F90 “::” syntax when declaring “my_valid”.

For example:

% cat testand.CUF 

module testand

contains

attributes(device) subroutine test (valid)

integer, device, intent( out )            :: valid 
#ifdef WORKS
integer my_valid
#else
integer :: my_valid
#endif

valid = syncthreads_and(my_valid)

end subroutine test

end module testand
% pgf90 -V10.6 -c testand.CUF -Mcuda=cuda3.0
PGF90-S-0155-Could not resolve generic procedure syncthreads_and (testand.CUF: 15)
  0 inform,   0 warnings,   1 severes, 0 fatal for test
% pgf90 -V10.6 -c testand.CUF -Mcuda=cuda3.0 -DWORKS
%

Sorry for the inconvenience,
Mat

Ah, I see. Thanks very much Mat. I think I’ve come across that bug on the forum before but for some reason I though it was the other way around (i.e. that the ‘::’ notation works and the bug occurs without it).

Great so I’m able to compile but I’m getting unexpected behaviour from the syncthreads_and call. Here is the subroutine:

! Check the minimum separation between atoms in the cell
attributes( global ) subroutine check_minsep( N, atoms, minsep, valid )

	use cudafor
	use bcUtils

	! Input parameters
	integer, value, intent( in )				:: N
	real( kind_wp ), dimension( N, 3 ), device, intent( in )	:: atoms
	real( kind_wp ), value, intent( in )		:: minsep
	integer, device, intent( out )				:: valid

	!! Subroutine variables
	integer					:: i, j, i_max, j_max
	real( kind_wp )			:: dx, dy, dz, sep_sq, max_sep_sq
	real( kind_wp )			:: a_num
	integer					:: a_from, a_to
	integer					my_valid		! Is the bit of the cell of this thread valid?

	max_sep_sq = minsep * minsep

	! Calculate atom numbers that will be done by this thread
	a_num	= N / blockdim%x
	a_from	= a_num * ( threadidx%x - 1 ) + 1
	a_to	= min( a_num * threadidx%x, N )

	! Assume valid unless test has failed
	my_valid	= 1
	valid		= 1

	outer: do i = a_from, a_to
		do j = 1, N
			if( i .ne. j ) then
				dx = atoms( i, 1 ) - atoms( j, 1 )
				dy = atoms( i, 2 ) - atoms( j, 2 )
				dz = atoms( i, 3 ) - atoms( j, 3 )

				sep_sq = dx * dx + dy * dy + dz * dz
				if( sep_sq .le. max_sep_sq ) then
					my_valid = 0
					exit outer
				end if
			endif

		end do
	end do outer

#ifdef SANITY_CHECK
       my_valid = 0
#endif

	valid = syncthreads_and( my_valid )

#ifdef ONE_THREAD_TEST
       valid = my_valid
#endif

end subroutine check_minsep

As you can see I’m checking that no atoms are closer than a given diameter.

Here’s the behaviour that I get with the various flag setting when I check the value of ‘valid’ on the host once check_minsep has returned:

– No flags set –
valid = 1 always

– SANITY_CHECK set –
valid = 1 always

– ONE_THREAD_TEST set (kernel called with <<< 1, 1 >>> –
valid = my_valid (as expected)

Any ideas?

Again, thanks for your help.

-Martin

Hi Martin,

Are you trying to ensure that all atoms are valid? “syncthreads” and it’s derivatives will only work within on a single thread block. Hence, the value of “valid” will depend upon which thread block executed last. Currently, global synchronization in CUDA is not supported.

What you’ll need to do is to create an array to hold the “valid” value for each thread. Once the kernel is finished, you’ll then perform a reduction operation on “valid” array.

Hope this helps,
Mat

P.S.

Using a scalar to return a value from a kernel like you’re doing with ‘valid’ here may be problematic. Every thread will be updating the same location in global memory. The final value will be non-deterministic since it will depend upon which ever thread updated it last.

Hi Mat,

thanks for getting back to me.

Yes, I’m trying to make sure that no atoms are closer than a given minimum separation. I’m aware that global sync isn’t supported; I should have said the kernel above will only ever be executed by a single thread block (i.e. <<< 1, x >>>). So I divy up the range of atoms between threads in a single thread block and consider the whole system valid if, and only if, every thread reports its section to be valid.

Just to clarify:

What you’ll need to do is to create an array to hold the “valid” value for each thread. Once the kernel is finished, you’ll then perform a reduction operation on “valid” array.

Presumably syncthreads_and (and the others or/count etc) can essentially do a reduction of sorts within a thread block (which is all I need).

Perhaps it might help me most if you were able to provide a quick example of using syncthreads_and within a thread block.

Regards,
-Martin

Ah, now here’s something I didn’t realise:

Devices of compute capability 2.x support three variations of __syncthreads() described below[1]. (> count, and, or> )

I’m using a CC 1.1 card. I’ve now added cc11 to the -Mcuda command and, rightly so, it won’t compile.

I used to use the sitenvrc file to set this and the lib directory but I recently updated to 10.9 which doesn’t seem to use that file. What is the standard practice now (other than setting the flag at compile time)?

Just out of curiosity, once I get a CC 2.0 card would the following code do what I was describing above:

integer, device, intent( out ) :: valid
integer       my_valid, our_valid
...
our_valid = syncthreads_and( my_valid )

if( mod( blockdim%x, threadidx%x ) .eq. 0 ) then
    valid = our_valid
end if

So all threads receive the ‘reduced’ our_valid and then one (within the block) writes it to valid. Again this all being performed using a grid of one block.

Presumably even within one block the operation:

valid = syncthreads_and( my_valid )

is not necessarily safe as all threads are writing to the same address…that said they will all be writing the same value.

Cheers,
-Martin

[1] CUDA C Programming Guide

I’m using a CC 1.1 card. I’ve now added cc11 to the -Mcuda command and, rightly so, it won’t compile.

And you’ll need to compile using the CUDA 3.0 or 3.1 libraries (-Mcuda=3.1).

I used to use the sitenvrc file to set this and the lib directory but I recently updated to 10.9 which doesn’t seem to use that file. What is the standard practice now (other than setting the flag at compile time)?

sitenvrc has been deprecated. Currently the compiler will generate a single binary targeting multiple versions of your code for a variety of compute capabilities that can support the users code. For example, a single precision code will have CC2.0, CC1.3,and CC1.0 support in the same binary. A double precision code will only have CC2.0 and CC1.3. The CC generated does not depend upon build system’s installed device.

So all threads receive the ‘reduced’ our_valid and then one (within the block) writes it to valid. Again this all being performed using a grid of one block.

Yes I think it will work, but limiting your program to use only one block seems, well, limiting (and slow) Why not have valid be array with one element per block?

integer, device, intent( out ) :: valid
integer       my_valid, our_valid
...
our_valid = syncthreads_and( my_valid )

! no need to use mod, threads ids are per block
if( threadidx%x  .eq. 1 ) then
    valid(blockidx%x) = our_valid
end if
  • Mat