Optimising CUDA code for Lennard Jones Potential Calculation

Omi4u · August 2, 2017, 2:24am

I have a grid of 30x30x30 periodic cells for which I need to calculate Lennard Jones Potential Force using CUDA. A global function calls this device function and supplies it with the data structure at each point and (x, y, z) position for each point.

__device__ void calculateForce(point a[][DATAYSIZE][DATAZSIZE], int idx, int idy, int idz){
    a[idx][idy][idz].force_a[0] = 0;
    a[idx][idy][idz].force_a[1] = 0;
    a[idx][idy][idz].force_a[2] = 0;

    a[idx][idy][idz].force_b[0] = 0;
    a[idx][idy][idz].force_b[1] = 0;
    a[idx][idy][idz].force_b[2] = 0;

    int dx, dy, dz;
    float fax = 0, fay = 0, faz = 0;
    float fbx = 0, fby = 0, fbz = 0;

    float e1 =  24 * epsilon_a/da;
    float e2 =  24 * eab/dab;
    float e3 =  24 * epsilon_b/db;
    for (int i = idx-CUTOFF; i < idx+CUTOFF; i++){
        for(int j = idy-CUTOFF; j < idy+CUTOFF; j++){
            for(int k = idz-CUTOFF; k < idz+CUTOFF; k++){
                dx = idx - i;
                dy = idy - j;
                dz = idz - k;
                float distance = sqrtf(dx*dx+dy*dy+dz*dz);
                if(distance > CUTOFF){
                    continue;
                }
                int x=i,y=j,z=k;
                if(x>DATAXSIZE-1){
                    x -= DATAXSIZE;
                }else if(x<0){
                    x += DATAXSIZE;
                }
                if(y>DATAYSIZE-1){
                    y -= DATAYSIZE;
                }else if(y<0){
                    y += DATAYSIZE;
                }
                if(z>DATAZSIZE-1){
                    z -= DATAZSIZE;
                }else if(z<0){
                    z += DATAZSIZE;
                }

                float sqa = da/distance;
                float sqab = dab/distance;
                float sqb = db/distance;

                float faa = e1 * (2 * powf(sqa, 13) - powf(sqa, 7));
                float fab = e2 *(2 * powf(sqab, 13) - powf(sqab, 7)) ;
                float fbb = e3 * (2 * powf(sqb, 13) - powf(sqb, 7));

                float na = (a[x][y][z].na * faa + a[x][y][z].nb * fab)/distance;
                float nb = (a[x][y][z].na * fab + a[x][y][z].nb * fbb)/distance;

                fax += na * dx;
                fay += na * dy;
                faz += na * dz;

                fbx += nb * dx;
                fby += nb * dy;
                fbz += nb * dz;
            }
        }
    }
    a[idx][idy][idz].force_a[0] = fax;
    a[idx][idy][idz].force_a[1] = fay;
    a[idx][idy][idz].force_a[2] = faz;

    a[idx][idy][idz].force_b[0] = fbx;
    a[idx][idy][idz].force_b[1] = fby;
    a[idx][idy][idz].force_b[2] = fbz;
}

For a cube of length (2 x CUTOFF radius), I calculate distance of each cell from the centre of the cube. If the distance is less than the CUTOFF radius, then I calculate Lennard Jones Potential Force. I have already minimised repetitive interactions for this code.

But here, addition assignment operations from line 55 to 61 seem to be extreme bottleneck. Is it possible to optimise it further ?

njuffa · August 2, 2017, 8:56am

Instead of computing ‘distance’ and then dividing by it, try computing ‘recip_distance = rnorm3df (dx, dy, dz)’, then multiplying by ‘recip_distance’. Obviously you will have to adjust the check against ‘CUTOFF’, instead checking against 1.0f / CUTOFF.

Calling powf() is likely more expensive than the equivalent straight-line discrete computation:
x2 = x * x; x4 = x2 * x2; x3 = x2 * x; x7 = x4 * x3; x13 = x7 * x3 * x3;

Omi4u · August 3, 2017, 2:46am

It gives me a bit of speed up, but not much. As I mentioned extreme bottleneck is in the addition assignment operations. Is there any other way to minimize it ?

Robert_Crovella · August 3, 2017, 3:01am

In general, in CUDA C, this is a bad multidimensional array indexing pattern, assuming your variables idx,idy,and idz have the usual convention (i.e. idx includes threadIdx.x without any multiplicative scaling, idy includes threadIdx.y, etc.):

a[idx][idy][idz].force_a[0] = 0;

The desired pattern would be:

a[idz][idy][idx].force_a[0] = 0;

(I’m also assuming a is stored in global memory. However similar rules would likely apply to shared memory, and I’d have think about the implications carefully if a is actually in the local logical space.)

This allows adjacent threads in a warp to have adjacency of memory access (in C/C++), i.e. improved coalescing behavior.

I’m also pretty sure this would apply to this sort of indexing:

a[y][z].na

based on your actual derivation of the x,y,z loop variables.

The negative effect of your current indexing scheme should be visible using the profiler, e.g. looking at metrics gld_efficiency and gst_efficiency, or else just carefully following the guided analysis in the visual profiler.

However, this may be a moot point since you are using an AoS storage pattern. The first step would be to convert to an SoA storage pattern, then reverse the order of your indexing through the various arrays.

Omi4u · August 3, 2017, 3:55am

Hi txbob, thanks for your reply. This is how I do indexing,

#define DATAXSIZE 65
#define DATAYSIZE 65
#define DATAZSIZE 65
#define BLKXSIZE 9
#define BLKYSIZE 2
#define BLKZSIZE 7

const dim3 blockSize(BLKXSIZE, BLKYSIZE, BLKZSIZE);
const dim3 gridSize(((DATAXSIZE+BLKXSIZE-1)/BLKXSIZE), ((DATAYSIZE+BLKYSIZE-1)/BLKYSIZE), ((DATAZSIZE+BLKZSIZE-1)/BLKZSIZE));

calculateMean<<<gridSize,blockSize>>>(points);

From calculateMean global function, calculateForce device function is called.

unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
unsigned idy = blockIdx.y*blockDim.y + threadIdx.y;
unsigned idz = blockIdx.z*blockDim.z + threadIdx.z;

Do you still think its inefficient ?

Robert_Crovella · August 3, 2017, 4:32am

yep

Omi4u · August 3, 2017, 4:37am

That’s a good advice ! Thanks !!

Topic		Replies	Views
3D arrays - where to start? CUDA Programming and Performance	12	20348	October 13, 2013
How to avoid writing conflcit on writing to the same location on global memory? CUDA Programming and Performance	12	3188	March 27, 2009
how to implement double for loops in CUDA CUDA Programming and Performance	23	15782	January 30, 2012
Help With Some Simple Code CUDA Programming and Performance	2	1148	March 11, 2009
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Optimizing CUDA kernel with atomics CUDA Programming and Performance	10	303	August 19, 2024
Polite ask for help with paralelling program using CUDA. Task: Rewrite given program so it will use CUDA Programming and Performance	13	9284	April 20, 2011
Can anyone give some advice to optimize my kernel? CUDA Programming and Performance	4	457	November 8, 2022
Faster and more accurate implementation of logf() CUDA Programming and Performance	10	2128	August 7, 2024
Optimising Lennard Jones CUDA Programming and Performance	2	1241	September 3, 2011

Optimising CUDA code for Lennard Jones Potential Calculation

Related topics