Hi all,
Lets imagine such situation: I have a set of n atoms and I would like to calculate e.g. forces between them. In classical C++ CPU I could write sth like that:
for (i=0 ; i<n ; i++)
for (j=0 ; j<n ; j++)
{
if (i!=j)
{
F = //some calculations
F[i] = F[i] + F;
}
}
So i-loop goes for all atoms and j-loop goes for all atom i neighboars. So I need to run n*n steps in my for loops.
No i want to use CUDA for this purpose. Lets imagina that n = 1 114 728, so I must use both grids and threads, eg: nr of grids 46 447, nr of threads 24. I’ve tried to do sth like that (assume 1 dimension):
__global__ void force(const int *nr_of_atoms, float *atom_coordinate_x, float **F, int *grids_nr, int *threads_nr)
{
int i = blockIdx.y ;
int j = threadIdx.y ;
int k = blockIdx.x ;
int l = threadIdx.x ;
int index_1, index_2;
index_1 = threadIdx.x + blockIdx.x * blockDim.x ;
index_2 = threadIdx.y + blockIdx.y * blockDim.y ;
F[index_1] = F[index_1] + //some calculations depends on atom_coordinate_x[index_1] and atom_coordinate_x[index_2]
}
F is the place I would like to carry the results of my calculations.
variable index_1 represent the outer for loop in classical CPU code (loop which goes for each atoms in the system).
variable index_2 represent the inner for loop in classical CPU code (loop which goes for all neighbour of selected atoms).
Of course this way fails due to the fact that I want not only to read data from array F, but I also want each tread to write there some data. So the problem is how to communicate between threads in different grids?
To explain it more clearly lets assume that F is a 0-vector at the begining, so F = [0 0 0 … ]. If each interaction between two atoms in the system id defined as “1”, I will expect that the result of my CUDA code should by F = [n-1 n-1 n-1 …]. However now the code returns F=[1 1 1 …]. There is no communication between threads, so I can see only a single result of performance of the “last” thread.
The is a solution named shared memory, but it allows me to make communication between threads in one grid, it is still not enough.
Have you got any ideas how to improve my code. I dont expect full code. Some ideas where or what I should read about will be enough. E.g. I’ve read that neither atomic operations nor streams not solve my problems (they are threat about different CUDA’s possibilities).
Thanks in advance.
Lukas