how to implement double for loops in CUDA

Hi all,

Lets imagine such situation: I have a set of n atoms and I would like to calculate e.g. forces between them. In classical C++ CPU I could write sth like that:

for (i=0 ; i<n ; i++)

    for (j=0 ; j<n ; j++)

    { 

          if (i!=j)

          {

              F = //some calculations

              F[i] = F[i] + F;

          }

    }

So i-loop goes for all atoms and j-loop goes for all atom i neighboars. So I need to run n*n steps in my for loops.

No i want to use CUDA for this purpose. Lets imagina that n = 1 114 728, so I must use both grids and threads, eg: nr of grids 46 447, nr of threads 24. I’ve tried to do sth like that (assume 1 dimension):

__global__ void force(const int *nr_of_atoms, float *atom_coordinate_x, float **F, int *grids_nr, int *threads_nr)

{

       int i = blockIdx.y ;

       int j = threadIdx.y ; 

int k = blockIdx.x ;

       int l = threadIdx.x ;

int index_1, index_2;

index_1 = threadIdx.x + blockIdx.x * blockDim.x ;

       index_2 = threadIdx.y + blockIdx.y * blockDim.y ;

F[index_1] = F[index_1] + //some calculations depends on atom_coordinate_x[index_1] and atom_coordinate_x[index_2]

}

F is the place I would like to carry the results of my calculations.

variable index_1 represent the outer for loop in classical CPU code (loop which goes for each atoms in the system).

variable index_2 represent the inner for loop in classical CPU code (loop which goes for all neighbour of selected atoms).

Of course this way fails due to the fact that I want not only to read data from array F, but I also want each tread to write there some data. So the problem is how to communicate between threads in different grids?

To explain it more clearly lets assume that F is a 0-vector at the begining, so F = [0 0 0 … ]. If each interaction between two atoms in the system id defined as “1”, I will expect that the result of my CUDA code should by F = [n-1 n-1 n-1 …]. However now the code returns F=[1 1 1 …]. There is no communication between threads, so I can see only a single result of performance of the “last” thread.

The is a solution named shared memory, but it allows me to make communication between threads in one grid, it is still not enough.

Have you got any ideas how to improve my code. I dont expect full code. Some ideas where or what I should read about will be enough. E.g. I’ve read that neither atomic operations nor streams not solve my problems (they are threat about different CUDA’s possibilities).

Thanks in advance.

Lukas

Check sdk example nbody, it does just this task.

Your cde will not work without tomic operations. An those will be slow. The problem is that each thread loads F[index_1] and then changes and returns it. The result will depend on the order the theads are executed

What I would do: Save the data to a matrix F(i,j) and then calculate the forces by doing the sum over the columns or lines. For the sum you need to do a reduction algorithm. In C would look like this for 1 array of size Np:

global void vecsum(double *dev_out,double *dev_uuu,const int Np)
{
shared double temp[bsl];
int ind= threadIdx.x;
int iii=ind;
double uuu=0;
while(iii<Np)
{
uuu=uuu+dev_uuu[iii];
iii=iii+blockDim.x;
}
temp[ind]=uuu;
__syncthreads();

for(int offset=bsl/2;offset>=1;offset=offset/2)
{
if(ind<offset)
{
temp[ind]=temp[ind]+temp[ind+offset];
}

__syncthreads();

}
if(ind==0)
{
dev_out[0]=temp[0];
}
}

This code would run with only one block of size bsl (<<<1,bsl>>>). I use a shared variable the size of the block. First I collect all the partial sums and then I use a reduction to go from bsl to 1 element. Yo can submit this also with more blocks and each bloc would do te summation for 1 particle, though this will be slow because of the synchronizing. Better to submit 1 block for 1 articles and use enough streams to fill the gpu with summations. For a Tesla card you would need 14 streams ( 1 per each SMP) to fill the gpu.

I’ve tried this way. But I am very new to CUDA, so i completely did not understand this code. I have even failed with recognizing this part of code which makes the calculations.

I’ ve tried to initialise F[n][n] array to store in the F[i][j] the partial interaction of atom j to atom i. But I failed with initialize (dynamicaly) array having 1 000 000 x 1 000 000 elements.

Then I tried to use CUDA only to the outer loop. So I would like to have one for loop inside the kernel. But I manage to have only ~2000 for intercations inside the kernel. If more, my graphics driver fails and breaks the program and the results are lost. I have card GF GT 325M in my laptop. I think this way is very similar to your proposition. However I will try to check your proposition.

Hello,

I am working on a code now in which I do only the inner loop on the gpu. In this case you only have to define 2 array with N elements, but you still have to do the sum on the gpu. In this case you need more than 500 particles to fill the gpu. Also you should play around with the number of threads per block. I found for my case on Tesla 512 was optimal.

So you code would look like:

for (i=0 ; i<n ; i++)
{
calculate forces on the gpu for and the result would be stored in a an array Fij[1:N];
now make a sum on the gpu and store the result in an array F[j];
}

this sum can be implemented in 2 stages or 1 similar to the code I gave before. I am curious what will come out, as I am working on similar problem.

You have to read programming guide and learn cuda first. If you could not get sdk example. And it could be pointless to calculate one line of matrix on gpu cause of transfer cost. You should have very complicated interaction function in order to get this effective.

I have to disagree with you. I doing it and over 7000 particles the gpu (Tesla), gets filled ( 20 times speed up). Below 7000 I use streams to run 2 or more copies which gives me more measurements. I do not know all the details of the poster, but he has more than 1000000 particles, though there will be 1000000 calls as well for each time step.

What was your function you compute for each particle?

I have a Monte Carlo problem. I compute the change in energy for 1 jump. I use the minimum image convention each particle interacts with each particle. Two Coulomb interaction (1/r) and one Yukawa(exp(-lambda*r)/r with rcut=10/lambda and lambda=54). I set the number of particles multiple of 512. For 3584 I get speed up 12 times compared to single Xeon 3450, but in this case only half of the gpu is filled so I run 2 copies to get more measurements, so in total I get 25 time more measurements than I would get on a single core run.

I suppose your cpu code was far from optimal.

There is not much to optimize when there is long range interaction. For each jump there are N interactions to be computed. If you have some general guide lines for optimize the cpu code or a fast code I can test against , I am happy to try.

Hello,

I think that one the prvious posters suggested something to look at hits paper:

http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCQQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.156.7082%26rep%3Drep1%26type%3Dpdf&ei=WivrTq65Go_64QTY6dCHCQ&usg=AFQjCNG5ptd9sGsVGMyokNHHzg0A6cD74w

It might help you.

Ok, I’ve done some investigations:

  1. parallelization of the inner loop;

  2. simple CPU version;

  3. tile version (as was suggested in the article which was adviced me in the post above

in the 2-loop problem

Now the code of the programs and some results.

  1. Not optimized version of CUDA code - only 1 for-loop was sent to GPU:
#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

#include <iostream>

const int threadsPerBlock = 512;

using namespace std;

__global__ void sila(const int *kn, int *kth, float *kF, float *ka, int *ki)

{

    int j;

	__shared__ float cache[threadsPerBlock];

	int cacheIndex = threadIdx.x;

	cache[cacheIndex] = 0.0;  

if (cacheIndex < *kth)

{

	for (j= cacheIndex ; j<*kn ; j=j+*kth)

	{

        if(j!=*ki)

		{

            cache[cacheIndex] = cache[cacheIndex] + kF[j];

			

			

		}

		

	}

	

}

	__syncthreads();

	//reduction algorithm

	

	int i = blockDim.x / 2;

	while (i != 0)

	{

		if(cacheIndex < i)

		{

			cache [ cacheIndex ] +=  cache [ cacheIndex + i];

		}

		__syncthreads();

		i /= 2;

	}

	__syncthreads();

	if(cacheIndex == 0)

	ka[*ki] = cache[0];

}

int main()

{

	FILE *fp;

	fp=fopen("dane.dat","w");

	int n_thread = 512;

	int *dev_th, *dev_n, *dev_i;

	float *dev_F, *dev_a;

	int n=100000;

	int i,j;

	float *F  = new float[n];

	float *a  = new float[n];

	time_t t1,t2;

	t1 = time(NULL);

	for(i=0;i<n;i++)

	{

		F[i]=1.0;

		a[i]=0.0;

	}

	cudaMalloc( (void**)&dev_th, sizeof(int) ) ;

	cudaMalloc( (void**)&dev_F, n * sizeof(float) ) ;

	cudaMalloc( (void**)&dev_a, n * sizeof(float) ) ;

	cudaMalloc( (void**)&dev_n, sizeof(int) ) ;

	cudaMalloc( (void**)&dev_i, sizeof(int) ) ;

	cudaMemcpy ( dev_F, F, n * sizeof(float), cudaMemcpyHostToDevice) ;

	

	cudaMemcpy ( dev_th, &n_thread, sizeof(int), cudaMemcpyHostToDevice) ;

	cudaMemcpy ( dev_n, &n, sizeof(int), cudaMemcpyHostToDevice) ;

	

	for(i=0;i<n;i++)

	{

	

		cudaMemcpy ( dev_i, &i, sizeof(int), cudaMemcpyHostToDevice) ;

		cudaMemcpy ( dev_a, a, n * sizeof(float), cudaMemcpyHostToDevice) ;

		sila<<<1,n_thread>>>(dev_n,dev_th,dev_F,dev_a,dev_i);

		cudaMemcpy ( a, dev_a,n * sizeof(float), cudaMemcpyDeviceToHost) ;

	}

	

	for(i=0;i<n;i++)

	{

		

fprintf(fp,"%f ",a[i]);

		

         fprintf(fp,"\n");

	}

	fclose(fp);

	cudaFree(dev_a);

	cudaFree(dev_F);

	cudaFree(dev_n);

	cudaFree(dev_th);

	cudaFree(dev_i);

	delete(F);

	t2 = time(NULL); 

    cout<<"czas: "<<t2-t1<<endl; 

return 0;

}

In the code cited above both the number of threads and size of the problem (n) can be changed.

When the nr of threads used for calculation was changed the speed of the calculation changes like this: results

So for 512 threads the speed was quite nice, but how it changes with the size of the system (number of atoms)> Like this:

n time [s] (n_threads=512)

10000 6
100000 168
200000 603
300000 1289

I decided not to check the size of n=1114728 (size of my current system). I have notes (I am not sure if correct) with fitting the exp function. This estimation told me that I need ~2 days to calculate one double for loop of such system. To long.

Than I checked (2.) simple CPU code (not multithreaded, just calculate on single core)

// test_for.cpp : Defines the entry point for the console application.

//

//#include "stdafx.h"

#include <stdio.h>

#include <iostream>

#include <time.h>

using namespace std;

int main()

{

    FILE *fp;

	int n = 100000;

	int i,j;

	float *F = new float[n];

	float *a = new float[n];

	fp = fopen("dane.dat","w");

		time_t t1,t2;

		t1 = time(NULL);

	for(i=0;i<n;i++)

	{

		F[i]=1.0;

		a[i]=0.0;

	}

	for(i=0;i<n;i++)

	{

		for(j=0;j<n;j++)

		{

			if(i!=j)

				a[i] = a[i] + 1.0;

		}

	}

	for(i=0;i<n;i++)

	{

		

fprintf(fp,"%f ",a[i]);

		

         fprintf(fp,"\n");

	}

	fclose(fp);

	delete(F);

	delete(a);

	t2 = time(NULL); 

	cout<<"czas: "<<t2-t1<<endl; 

	return 0;

}

This code is faster than previously shown CUDA code.

Finally, I’ve tried to write sth similar to the code suggested in “N body problem” in the NVidia materials (link is some posts above).

The idea of this code is to read data in a tile of size p. So if I have N atoms, I make N/p blocks and each block has p threads. Each thread reads p elements of array from global memory, do the calculations, write data and read another portion of data. What is strange, in the code cited below I was able to set no more than p=32. If more program got wrong results (or comletely wrong like sth.e+28). I dont understand why.

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <stdio.h>

#include <iostream>

const int threadsPerBlock = 32; // !! that is where I can change tile size

#define N 300000

using namespace std;

__global__ void tile_p(float *kF, float *kA, int *kl)

{

	int j;

	__shared__ float cache[threadsPerBlock];

	int cacheIndex = threadIdx.x;

	int blockIndex = blockIdx.x * *kl + threadIdx.x;

	cache[cacheIndex] = 0.0;

	for(j=0;j<*kl;j++)

		cache[cacheIndex] += kF[j];

//	__syncthreads();

	kA[blockIndex]+=cache[cacheIndex];

	__syncthreads();

}

int main()

{

	FILE *fp;

	fp=fopen("dane.dat","w");

	float *F  = new float[N];

	float *A  = new float[N];

	int i,n=threadsPerBlock;

	float *dev_F,*dev_A;

	int *dev_tile;

	int N_p,p=threadsPerBlock;

	int tile_size=threadsPerBlock;

	time_t t1,t2;

	t1 = time(NULL);

	N_p = int(N/p) +1;

	for(i=0;i<N;i++)

	{

		F[i]=1;

		A[i]=0;

	}

	cudaMalloc( (void**)&dev_tile, sizeof(int) ) ;

	cudaMemcpy ( dev_tile, &tile_size, sizeof(int), cudaMemcpyHostToDevice) ;

	cudaMalloc( (void**)&dev_F, n * sizeof(float) ) ;

	cudaMalloc( (void**)&dev_A, N * sizeof(float) ) ;

	cudaMemcpy ( dev_A, A, N * sizeof(float), cudaMemcpyHostToDevice) ;

	for(i=0;i<N;i=i+tile_size)

	{

	cudaMemcpy ( dev_F, &F[i], n * sizeof(float), cudaMemcpyHostToDevice) ;

	

	tile_p<<<N_p,p>>>(dev_F,dev_A,dev_tile);

	

	

	}

	cudaMemcpy ( A, dev_A, N * sizeof(float), cudaMemcpyDeviceToHost) ;

	

	for(i=0;i<N;i++)

	{

		

fprintf(fp,"%f ",A[i]);

		

         fprintf(fp,"\n");

	}

	fclose(fp);

	t2 = time(NULL); 

    cout<<"czas: "<<t2-t1<<endl; 

	cudaFree(dev_F);

	cudaFree(dev_A);

	cudaFree(dev_tile);

	delete(F);

	delete(A);

	return 0;

}

The speed of such code is quite nice, faster than CPU code. See picture: speed_comparison.

Ok, now the question is “can I make my code better”, achieving better speed? I am not sure if I use “data reuse” and “loop unrolling”, because I think I dont understand the idea behing this two terms. I am not sure if using more than one thread for calculation single atom force will improve the results. And of course I use only 32 threads from 512 I have for calculation. I really dont understand what is going wrong when I increase the tile size (the same as number of threads involving in calculations). Probably there is a problem with shared memory, but I must spend some time to realize what is wrong exactly.

I guess that it must be some additional way to improve the code, because the improve of the speed should be about 100, not 3 as I got compared to simple CPU code. Even considered my relatively poor hardware (personal laptop, not a cloud computing).

Hello,

I am working on a similar code . Maybe the #pragma unroll in front of for loop in the tile_p kernel might help. You can try tiles which are not square pxm tile.

Sth like that?

#pragma unroll 4   

	for(j=0;j<*kl;j++)

		cache[cacheIndex] += kF[j];

I didn’t see any improvement. I dont know what number insert in pragma directive? Now I will try with longer tile-frame.

Ok, I managed to get some positive effect but it seems that it was not related to use pragma:

Originally I have:

for(j=0;j<*kl;j++)

cache[cacheIndex] += kF[j];

kl is the tile length.

for n=300 000 time is 133s.

Than:

for(j=0;j<32;j++)

cache[cacheIndex] += kF[j];

gives time = 94s.,

and finally sth like that:

cache[cacheIndex] += kF[0];

cache[cacheIndex] += kF[1];

cache[cacheIndex] += kF[2];

cache[cacheIndex] += kF[3];

cache[cacheIndex] += kF[4];

cache[cacheIndex] += kF[5];

cache[cacheIndex] += kF[6];

cache[cacheIndex] += kF[7];

cache[cacheIndex] += kF[8];

cache[cacheIndex] += kF[9];

cache[cacheIndex] += kF[0];

cache[cacheIndex] += kF[10];

cache[cacheIndex] += kF[11];

cache[cacheIndex] += kF[12];

cache[cacheIndex] += kF[13];

cache[cacheIndex] += kF[14];

cache[cacheIndex] += kF[15];

cache[cacheIndex] += kF[16];

cache[cacheIndex] += kF[17];

cache[cacheIndex] += kF[18];

cache[cacheIndex] += kF[19];

cache[cacheIndex] += kF[20];

cache[cacheIndex] += kF[21];

cache[cacheIndex] += kF[22];

cache[cacheIndex] += kF[23];

cache[cacheIndex] += kF[24];

cache[cacheIndex] += kF[25];

cache[cacheIndex] += kF[26];

cache[cacheIndex] += kF[27];

cache[cacheIndex] += kF[28];

cache[cacheIndex] += kF[29];

cache[cacheIndex] += kF[30];

cache[cacheIndex] += kF[31];

gives 88s.

Using pragma or not doesn’t change the time of calculations.