How to fill a very large array randomly using CUDA

alfvenwave · April 26, 2010, 4:04pm

Hi.

I have a general CUDA problem - if anyone can help me find a solution I will be more than happy to reference you in any papers that result (you’ll be helping the UK Fusion energy programme)…

Here is what I am stuck with. I have a very large multi-dimensional array (a particle distribution function - 2 spatial coordinates, and 3 velocity space coordinates). The array is about 200MB large, and hence fits into device memory onto a C1060 without any problem.

I then wish to follow a large number of particles (~10e6) in space and time, and augment that array with the time spent in each cell, summed over particles. The result is a particle distribution function.

On the face of it, CUDA seems like a good way of doing this - the problem is effectively SIMD, and the particles do not interact with eachother. Now, if I am doing this on a typical supercomputer, I typically use MPI. This means that each node has a good amount of memory. I can hold a temporary array, of the same dimensionality as my distribution function on each node, then simply sum them all up at the end.

With CUDA though, I have a problem. If you look at the “toy problem” code below, in order to illustrate the problem, you can easily see why it fails, giving an answer of 10 instead of the desired 160. I’ve stripped out some openMP that was there to split the problem over multiple cards, so sorry if it’s not as compact an example as it could be.

Each cuda thread is reading the contents of Fd(1) at the same time then writing to Fd(1) the value Fd(1)+1.0 - obviously a 2 step operation, at each of the 10 iterations of the loop. What we really want is for each thread to complete this operation one after another, for the example I’ve shown here, augmenting Fd(1) with a value of 16.0 for each of the ten loop iterations.

For my real code (not the toy example below), the probability of there being 2 particles in a cell grid at any one iteration is lower than the example here, but it nonetheless results in what one might call a “thread conflict numerical error”.

The obvious way to solve this, would be to give Fd an extra dimension, equal in size to the total number of threads, allowing one to fill the array indexed with thread number as well as spatial and velocity space indices, then integrate over that dimension at the end (analogous to what one usually does with MPI). Unfortunately, for a sensible number of threads, multiplied by my 200MB distribution function, a C1060 does not have enough device memory.

My question then is this - is there some established way of locking down the operation Fd(1) = Fd(1) + 1.0, so that individual threads have to carry this operation out one after another, hence removing the “read-write” conflict between threads? Or is there some other way of filling a very large array in a random fashion so that individual threads don’t conflict? Or, is this type of problem currently intractable using CUDA?

Any advice will be greatly appreciated and acknowledged,

kind regards,

Rob.

module process_mod

use cudafor

implicit none

contains

attributes(global) subroutine process_kernel( Fd )

implicit none

real         :: Fd(1000)
real         :: RAN
integer      :: i,N

do i=1,10

!  Is there some way of locking down this next operation so that it
!  must be completed by thread n before thread n+1 tries to do the same?

   Fd(1)=Fd(1)+1.0
enddo

end

subroutine process( Fh )

   use cudafor

   implicit none

   real         :: Fh(1000)
   real, device :: Fd(1000)
   integer      :: iflag,idev

   Fh = 0.0
   Fd = 0.0

!  Call Kernel for 4x4 = 16 threads:

   call process_kernel<<<4,4>>>( Fd )

   Fh = Fd

end

end module

program memexample

   use cudafor
   use process_mod

   implicit none

   real         :: Fsum(1000),Fh(1000)
   real, device :: Fd  (1000)

   integer      :: i,iflag

   iflag = cudaSetDevice(0)

   call process(Fh)

   print*,  'Answer:', maxval(Fh)

end

MatColgrove · April 26, 2010, 7:37pm

Hi Rob,

My question then is this - is there some established way of locking down the operation Fd(1) = Fd(1) + 1.0, so that individual threads have to carry this operation out one after another, hence removing the “read-write” conflict between threads? Or is there some other way of filling a very large array in a random fashion so that individual threads don’t conflict? Or, is this type of problem currently intractable using CUDA?

I don’t believe there is a good way of doing this. What you want is a mutex lock on a variable. Mutex require atomic operations. While CUDA does have atomic operations, the variables must be integers (floats were added in CUDA 3.0 and compute capability 2.0, i.e. Fermi). However, atomics are only valid for a single block and are very slow. Hence, for large problems this is not a good solution.

What you could do is a reduction. Have each thread store it’s own value for the “Fd(1)+1.0” in a temporary array (you only need an array the size of number of threads) and then do a final reduction in serial. To work around the large memory size, you will need to ‘strip mine’, i.e. only send over portions of your problem at a time.

For example:

module process_mod

use cudafor

implicit none
real,device         :: Fd(1000)
real,device         :: FdL(16,100)

contains

attributes(global) subroutine process_kernel(N)

implicit none

real         :: RAN
integer, value :: N
integer      :: i,j,idx
idx  = (blockIdx%x-1)*blockDim%x + threadIdx%x

do i=1,N
   do j=1,10
      FdL(idx,i)=FdL(idx,i)+1
   enddo
enddo

end

attributes(global) subroutine process_kernel_sum(start,N)

implicit none

real         :: RAN
integer,value      :: start,end,N,idx
integer      :: i,j

do i=1,N
   do j=1,16
      Fd(start+i-1)=Fd(start+i-1)+FdL(j,i)
   enddo
enddo

end

subroutine process( Fh )

   use cudafor

   implicit none

   real         :: Fh(1000)
   integer      :: iflag,idev,i

   Fh = 0.0
   Fd = 0.0
   FdL = 0.0

!  Call Kernel for 4x4 = 16 threads:

   do i=1,1000,100
      FdL = 0.0
      call process_kernel<<<4,4>>>(100)
      call process_kernel_sum<<<1,1>>>(i,100)
   enddo
   Fh = Fd

end

end module

program memexample

   use cudafor
   use process_mod

   implicit none

   real         :: Fsum(1000),Fh(1000)

   integer      :: i,iflag

   iflag = cudaSetDevice(0)

   call process(Fh)

   print*,  'Answer:', maxval(Fh)
   print*, Fh(1), Fh(101), Fh(1000)

end

Hope this helps,
Mat

alfvenwave · April 26, 2010, 8:40pm

Hi Mat.

Thanks for this. I can’t quite see how this will work for my problem though (I was warned that CUDA can get messy).
Imagine if you will that the line in my code:

Fd(1)=Fd(1)+1.0

is now replaced by the following:

call random_number(RAN)
indx = int(RAN)*1000 + 1
Fd(indx) = Fd(indx) + 1.0

I can’t see how to cast this into code that looks like yours - it’s the random nature of the way a Monte Carlo code works that’s giving me grief. Is the way around this to store an array of values of indx then sum up using your strip mine example?

Rob.

alfvenwave · April 26, 2010, 9:03pm

One other quick question - doesn’t the fact that you are only using 10 threads to do the summations make this algorithm slow? Do you need to call syncthreads inside process_kernel to make sure it’s completed before process_kernel_sum tries to do its work?

I did toy with the idea of every time step in my code, pass back a long list of particle cell locations to the host (10e6 particles, 5 values each describing position in space and velocity space) then letting the host increment my distribution function whilst the GPU gets on with the next time step. I started along these lines only to discover a very large performance hit by placing the time step loop outside rather than inside my GPU tracking kernel (presumably at each time step, the GPU has to re-read all the particle states from global memory, whereas if the time loop is inside the kernel, particle states only have to be read at the start). This probably doesn’t make a lot of sense without seeing the code.

Would you be willing to have a look at the code itself? - it’s about 10 times longer than the example I’ve used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it’s not really your job to help customers with their code so no worries if you haven’t got time - thanks for all the help so far…

Rob.

alfvenwave · April 26, 2010, 9:37pm

One other quick question - doesn’t the fact that you are only using 10 threads to do the summations make this algorithm slow? Do you need to call syncthreads inside process_kernel to make sure it’s completed before process_kernel_sum tries to do its work?

I did toy with the idea of every time step in my code, pass back a long list of particle cell locations to the host (10e6 particles, 5 values each describing position in space and velocity space) then letting the host increment my distribution function whilst the GPU gets on with the next time step. I started along these lines only to discover a very large performance hit by placing the time step loop outside rather than inside my GPU tracking kernel (presumably at each time step, the GPU has to re-read all the particle states from global memory, whereas if the time loop is inside the kernel, particle states only have to be read at the start). This probably doesn’t make a lot of sense without seeing the code.

Would you be willing to have a look at the code itself? - it’s about 10 times longer than the example I’ve used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it’s not really your job to help customers with their code so no worries if you haven’t got time - thanks for all the help so far…

Rob.

MatColgrove · April 26, 2010, 11:23pm

Hi Rob,

I can’t see how to cast this into code that looks like yours - it’s the random nature of the way a Monte Carlo code works that’s giving me grief. Is the way around this to store an array of values of indx then sum up using your strip mine example?

I wrote an article for the PGInsider (our newsletter) which walks through an a simple Monte Carlo code that might help you (See: Account Login | PGI). You can’t call random number from a device kernel and will need to pre-compute these values. Though, the article gives a method for doing this.

The article also shows how to perform a simple sum reduction. Your histogram will follow a similar form. Though, instead of a single element in an array, each thread would need it’s own “bin”.

One other quick question - doesn’t the fact that you are only using 10 threads to do the summations make this algorithm slow?

With any reduction, the code ultimately needs to have a serial portion and it will be slow. Though if done correctly, the serial portion will be very small with little overall performance impact.

Do you need to call syncthreads inside process_kernel to make sure it’s completed before process_kernel_sum tries to do its work?

No. The synchronization is implicit for kernels having the same stream.

Would you be willing to have a look at the code itself? - it’s about 10 times longer than the example I’ve used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it’s not really your job to help customers with their code so no worries if you haven’t got time - thanks for all the help so far…

First, why don’t you see if my article helps any. If your still having problems, then we can take a look at your code for a few minutes and then try an send you in the right direction.

Note you might take a look PGI Accelerator Model as well. As my article shows, it does a great job with the kernel and reduction code. So if your code is dominated by the compute portion of the code rather then copying the data over to GPU (like my little example), then it might be the best way to go. Also, we are working on supporting CUDA device data within Accelerator regions. Once available, it will solve the overhead of copying the random numbers to the GPU.

Mat