Low performance when writing data on Global memory in kernel

Hi.
I have a cuda code like this:

__global__ void kernel(double* velocities,double* positions)
{

        int objectId=blockDim.y*blockIdx.x+threadIdx.y;
        int positionId=threadIdx.x;
        int ty=threadIdx.y;
        double force=0;


        //calculate force


        double accelerate=0;
        if(Mass!=0)
                accelerate=force/Mass;

	double velocity=velocities[objectId*8+positionId];
	velocity=velocity+ time * accelerate;

        velocities[objectId*8+positionId]=velocity;
        positions[objectId*8+positionId]=velocity*time +position;


}

void callKernel(…)
{
        dim3 DimGrid(32,1,1);
        dim3 DimThread(8,32,1);
        kernel<<<DimGrid,DimThread>>>(…);
}

my problem is when I run this code it takes 8 seconds to complete the task but when I remove lines 20 and 21 or change the kernel code like this:

__global__ void kernel(double* velocities,double* positions)
{

        int objectId=blockDim.y*blockIdx.x+threadIdx.y;
        int positionId=threadIdx.x;
        int ty=threadIdx.y;
        double force=0;


        //calculate force


        double accelerate=0;
        if(Mass!=0)
                accelerate=force/Mass;

        accelerate=4; // any number (4 is an example)
     
	double velocity=velocities[objectId*8+positionId];
	velocity=velocity+ time * accelerate;

         velocities[objectId*8+positionId]=velocity;
         positions[objectId*8+positionId]=velocity*time +position;


}

it takes less than 1 seconds to execute . It’s very annoying and I don’t know how to reduce time in first code. please help me.

The reason your code runs much faster when you remove lines 20/21 is that you are no longer modifying any externally visible state, so the compiler is free to optimize out most or all of your kernel code. Likewise, when you set accelerate=4, the compiler observes that it no longer needs to perform the previous division operation (which may be computationally expensive).

The first observation sheds very little light on how to make your kernel run faster. The second observation may offer an insight, however. The floating point division operation may be computationally expensive.

It’s not entirely clear where the Mass variable is coming from in the kernel code you have shown, however since presumably the mass(es) of your particles do not change, you may wish to precompute (ie. before running your kernel) the reciprocal of mass (let’s call it RMass = 1/Mass), and store those values, and then perform a floating point multiply instead in your kernel code:

accelerate=force* RMass;

Note that even this idea may be misguided, however. If the code you haven’t shown but marked as // calculate Force is expensive, the hardcoding of accelerate to 4 still allows the compiler to eliminate all of that force calculation code, and this may well be the reason for the dramatic speedup, if that is the heavy lifting in your kernel (and it may well be, if you are computing and summing multi-particle interactions to compute net force.) Therefore, I’m not optimistic that this suggestion will yield much benefit.

The net is that that this sort of analysis is not that useful. You’ll need to do profiling, and analysis driven optimization, or some other approach, to discover useful insights about optimizing your kernel.