Optimising CUDA code for Lennard Jones Potential Calculation

I have a grid of 30x30x30 periodic cells for which I need to calculate Lennard Jones Potential Force using CUDA. A global function calls this device function and supplies it with the data structure at each point and (x, y, z) position for each point.

__device__ void calculateForce(point a[][DATAYSIZE][DATAZSIZE], int idx, int idy, int idz){
    a[idx][idy][idz].force_a[0] = 0;
    a[idx][idy][idz].force_a[1] = 0;
    a[idx][idy][idz].force_a[2] = 0;

    a[idx][idy][idz].force_b[0] = 0;
    a[idx][idy][idz].force_b[1] = 0;
    a[idx][idy][idz].force_b[2] = 0;

    int dx, dy, dz;
    float fax = 0, fay = 0, faz = 0;
    float fbx = 0, fby = 0, fbz = 0;

    float e1 =  24 * epsilon_a/da;
    float e2 =  24 * eab/dab;
    float e3 =  24 * epsilon_b/db;
    for (int i = idx-CUTOFF; i < idx+CUTOFF; i++){
        for(int j = idy-CUTOFF; j < idy+CUTOFF; j++){
            for(int k = idz-CUTOFF; k < idz+CUTOFF; k++){
                dx = idx - i;
                dy = idy - j;
                dz = idz - k;
                float distance = sqrtf(dx*dx+dy*dy+dz*dz);
                if(distance > CUTOFF){
                    continue;
                }
                int x=i,y=j,z=k;
                if(x>DATAXSIZE-1){
                    x -= DATAXSIZE;
                }else if(x<0){
                    x += DATAXSIZE;
                }
                if(y>DATAYSIZE-1){
                    y -= DATAYSIZE;
                }else if(y<0){
                    y += DATAYSIZE;
                }
                if(z>DATAZSIZE-1){
                    z -= DATAZSIZE;
                }else if(z<0){
                    z += DATAZSIZE;
                }

                float sqa = da/distance;
                float sqab = dab/distance;
                float sqb = db/distance;

                float faa = e1 * (2 * powf(sqa, 13) - powf(sqa, 7));
                float fab = e2 *(2 * powf(sqab, 13) - powf(sqab, 7)) ;
                float fbb = e3 * (2 * powf(sqb, 13) - powf(sqb, 7));

                float na = (a[x][y][z].na * faa + a[x][y][z].nb * fab)/distance;
                float nb = (a[x][y][z].na * fab + a[x][y][z].nb * fbb)/distance;

                fax += na * dx;
                fay += na * dy;
                faz += na * dz;

                fbx += nb * dx;
                fby += nb * dy;
                fbz += nb * dz;
            }
        }
    }
    a[idx][idy][idz].force_a[0] = fax;
    a[idx][idy][idz].force_a[1] = fay;
    a[idx][idy][idz].force_a[2] = faz;

    a[idx][idy][idz].force_b[0] = fbx;
    a[idx][idy][idz].force_b[1] = fby;
    a[idx][idy][idz].force_b[2] = fbz;
}

For a cube of length (2 x CUTOFF radius), I calculate distance of each cell from the centre of the cube. If the distance is less than the CUTOFF radius, then I calculate Lennard Jones Potential Force. I have already minimised repetitive interactions for this code.

But here, addition assignment operations from line 55 to 61 seem to be extreme bottleneck. Is it possible to optimise it further ?

Instead of computing ‘distance’ and then dividing by it, try computing ‘recip_distance = rnorm3df (dx, dy, dz)’, then multiplying by ‘recip_distance’. Obviously you will have to adjust the check against ‘CUTOFF’, instead checking against 1.0f / CUTOFF.

Calling powf() is likely more expensive than the equivalent straight-line discrete computation:
x2 = x * x; x4 = x2 * x2; x3 = x2 * x; x7 = x4 * x3; x13 = x7 * x3 * x3;

It gives me a bit of speed up, but not much. As I mentioned extreme bottleneck is in the addition assignment operations. Is there any other way to minimize it ?

In general, in CUDA C, this is a bad multidimensional array indexing pattern, assuming your variables idx,idy,and idz have the usual convention (i.e. idx includes threadIdx.x without any multiplicative scaling, idy includes threadIdx.y, etc.):

a[idx][idy][idz].force_a[0] = 0;

The desired pattern would be:

a[idz][idy][idx].force_a[0] = 0;

(I’m also assuming a is stored in global memory. However similar rules would likely apply to shared memory, and I’d have think about the implications carefully if a is actually in the local logical space.)

This allows adjacent threads in a warp to have adjacency of memory access (in C/C++), i.e. improved coalescing behavior.

I’m also pretty sure this would apply to this sort of indexing:

a[y][z].na

based on your actual derivation of the x,y,z loop variables.

The negative effect of your current indexing scheme should be visible using the profiler, e.g. looking at metrics gld_efficiency and gst_efficiency, or else just carefully following the guided analysis in the visual profiler.

However, this may be a moot point since you are using an AoS storage pattern. The first step would be to convert to an SoA storage pattern, then reverse the order of your indexing through the various arrays.

Hi txbob, thanks for your reply. This is how I do indexing,

#define DATAXSIZE 65
#define DATAYSIZE 65
#define DATAZSIZE 65
#define BLKXSIZE 9
#define BLKYSIZE 2
#define BLKZSIZE 7
const dim3 blockSize(BLKXSIZE, BLKYSIZE, BLKZSIZE);
const dim3 gridSize(((DATAXSIZE+BLKXSIZE-1)/BLKXSIZE), ((DATAYSIZE+BLKYSIZE-1)/BLKYSIZE), ((DATAZSIZE+BLKZSIZE-1)/BLKZSIZE));
calculateMean<<<gridSize,blockSize>>>(points);

From calculateMean global function, calculateForce device function is called.

unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
unsigned idy = blockIdx.y*blockDim.y + threadIdx.y;
unsigned idz = blockIdx.z*blockDim.z + threadIdx.z;

Do you still think its inefficient ?

yep

That’s a good advice ! Thanks !!