Hi.

I have a general CUDA problem - if anyone can help me find a solution I will be more than happy to reference you in any papers that result (you’ll be helping the UK Fusion energy programme)…

Here is what I am stuck with. I have a very large multi-dimensional array (a particle distribution function - 2 spatial coordinates, and 3 velocity space coordinates). The array is about 200MB large, and hence fits into device memory onto a C1060 without any problem.

I then wish to follow a large number of particles (~10e6) in space and time, and augment that array with the time spent in each cell, summed over particles. The result is a particle distribution function.

On the face of it, CUDA seems like a good way of doing this - the problem is effectively SIMD, and the particles do not interact with eachother. Now, if I am doing this on a typical supercomputer, I typically use MPI. This means that each node has a good amount of memory. I can hold a temporary array, of the same dimensionality as my distribution function on each node, then simply sum them all up at the end.

With CUDA though, I have a problem. If you look at the “toy problem” code below, in order to illustrate the problem, you can easily see why it fails, giving an answer of 10 instead of the desired 160. I’ve stripped out some openMP that was there to split the problem over multiple cards, so sorry if it’s not as compact an example as it could be.

Each cuda thread is reading the contents of Fd(1) at the same time then writing to Fd(1) the value Fd(1)+1.0 - obviously a 2 step operation, at each of the 10 iterations of the loop. What we really want is for each thread to complete this operation one after another, for the example I’ve shown here, augmenting Fd(1) with a value of 16.0 for each of the ten loop iterations.

For my real code (not the toy example below), the probability of there being 2 particles in a cell grid at any one iteration is lower than the example here, but it nonetheless results in what one might call a “thread conflict numerical error”.

The obvious way to solve this, would be to give Fd an extra dimension, equal in size to the total number of threads, allowing one to fill the array indexed with thread number as well as spatial and velocity space indices, then integrate over that dimension at the end (analogous to what one usually does with MPI). Unfortunately, for a sensible number of threads, multiplied by my 200MB distribution function, a C1060 does not have enough device memory.

My question then is this - is there some established way of locking down the operation Fd(1) = Fd(1) + 1.0, so that individual threads have to carry this operation out one after another, hence removing the “read-write” conflict between threads? Or is there some other way of filling a very large array in a random fashion so that individual threads don’t conflict? Or, is this type of problem currently intractable using CUDA?

Any advice will be greatly appreciated and acknowledged,

kind regards,

Rob.

```
module process_mod
use cudafor
implicit none
contains
attributes(global) subroutine process_kernel( Fd )
implicit none
real :: Fd(1000)
real :: RAN
integer :: i,N
do i=1,10
! Is there some way of locking down this next operation so that it
! must be completed by thread n before thread n+1 tries to do the same?
Fd(1)=Fd(1)+1.0
enddo
end
subroutine process( Fh )
use cudafor
implicit none
real :: Fh(1000)
real, device :: Fd(1000)
integer :: iflag,idev
Fh = 0.0
Fd = 0.0
! Call Kernel for 4x4 = 16 threads:
call process_kernel<<<4,4>>>( Fd )
Fh = Fd
end
end module
program memexample
use cudafor
use process_mod
implicit none
real :: Fsum(1000),Fh(1000)
real, device :: Fd (1000)
integer :: i,iflag
iflag = cudaSetDevice(0)
call process(Fh)
print*, 'Answer:', maxval(Fh)
end
```