shard memory accesses is slow...

Yes, the visible code is the same. The difference is in the optimized calculation of dx, dy, and dz. External Image

wow! I didnt notice it! It’s probably a good optimization External Image

I’m sorry.

I wrote “Gflops” simply, because I can’t explain well in English.

About 2.0 * 10^6 instruction is 6.0 * 10^3 second.

I use profiler with command line.

I want to optimize my kernel with that method!

However, the bottleneck of my kernel seems to be shared memory accesses.

No. I’ve already explained that the time is spent in the calculation of dx, dy, and dz. Storing them to shared memory doesn’t make the calculation slow. But if you don’t store them, the compiler will eliminate their computation as well.

The struct

struct stA {

 float r;

 float s;

 float t;

 float u;

 float v;

 float w;

 float x;

 float y;

 float z;

};

has nine 32-bits element so you have probably the pair (data, bank) misaligned. If you want to check the bank conflicts, I’ll recommend use nine shared memory arrays instead on one array with the struct.

extern __shared__ sr[];

extern __shared__ ss[];

extern __shared__ st[];

...

extern __shared__ sz[];

Anyway, as tera commented, important part of your code are missing, for example the size of your problem, the size of your grid and blocks…

I see.

So, the cause of taking much time is not the calculations slow but the accesses?

I tried it.

The occupancy is over 0.9.

But, execution time remains on the same one.

I’m sorry.

The number of blocks is 30.

The number of threads is 320.

My kernel is big.

So, I can’t wrote all code.

A part of code is

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  dx              = sA[i].x - pos.x;

  dy              = sA[i].y - pos.y;

  dz              = sA[i].z - pos.z;

  r2              = dx * dx + dy * dy + dz * dz;

  invR            = 1.0f / sqrt(r2);

  sig             = a.x + sA[i].r;

  sig2            = invR * sig;

  sig2           *= sig2;

  sig6            = sig2 * sig2 * sig2;

  eps             = a.y * sA[i].t;

  dEdR            = eps * (12.0f * sig6 - 6.0f) * sig6;

  C_energy        = eps * (sig6 - 1.0f) * sig6;

  dEdR           += pos.w * sA[i].s * invR;

  C_energy       += pos.w * sA[i].s * invR;

  dEdR           *= invR * invR;

  energy         += C_energy;

  dx             *= dEdR;

  dy             *= dEdR;

  dz             *= dEdR;

  af.x           -= dx;

  af.y           -= dy;

  af.z           -= dz;

  sA[t1].x       += dx;

  sA[t1].y       += dy;

  sA[t1].z       += dz;

  t1              = (t1 + 1) & 31;

}

pos is float4. The other variables are float.

No.

The cause probably is that the calculation is slow (possibly including memory reads), not the shared memory accesses to store the results.

Where does that occupancy number come from? On the previous page of this thread you wrote the occupancy is 0.312. So did the shared memory layout change increase the measured occupancy from 0.312 to 0.9?

Is 320 the number of threads per block? If it were the total number of threads, it would be far too low to fill the device.

Change

invR            = 1.0f / sqrt(r2);

to

invR            = rsqrtf(r2);

This removes a lot of unnecessary code internally. And as you were inadvertently using double precision here, this additional code was very slow.

What is the for-loop and this magic

t1              = (t1 + 1) & 31;

for? Have you still dropped some code from the for-loop and not marked where it is missing?

Otherwise it seems to me like you misunderstood the CUDA execution model. Is the code producing correct results? I presume you have a CPU-only version that you are comparing against.

I changed my code.

And, I retried to measure occupancy using CUDA Profiler.

I’m sorry.

It means the number of threads per block.

I tried it.

So, the execution time is faster!

I’m amazed!

No. This kernel is producing correct results.

The result is the same one using CPU-only code.