shard memory accesses is slow...

Hello.

My CUDA program is so slow.

So I investigated the cause, I found out that the cause was the following codes.

sA[threadIdx.x].x     += dx;

sA[threadIdx.x].y     += dy;

sA[threadIdx.x].z     += dz;

sA is a array of structures.

The structure is the following.

struct stA {

    float r;

    float s;

    float t;

    float u;

    float v;

    float w;

    float x;

    float y;

    float z;

};

I thought that the cause was “shared memory bank conflicts” or “occupancy”.

However, according to CUDA Profiler,

warp_serialize = 0

Occupancy = 0.312

I think that neither “bank conflicts” nor “occupancy” are the cause.

So, What is the cause?

Can you help me?

I’m sorry for my poor English.

Thanks.

You are probably wrong with the cause of your program being slow. How did you found what makes your program slow? Looks like no large bank conflicts there.

Thanks for your answer.

I found that this program was about 0.3Gflops.
It’s too slow.
So, I investigated the cause.
And, I found out that the cause was the codes of “shard memory access”.

And how did you found it?

I removed those shard memory accesses from my program.
So, execution time is so faster.

This is a mistake, compiler removes other calculations too in that case, cause their result is not used if you remove store. You better post full code of the kernell. And check your measures, if you include start up time, copy time etc.

Thanks for your advice.

My kernel has about 600 line.
Therefore, I can’t post full code.

Execution time of full kernel code(about 600 line) is about 0.6ms.
But, execution time of a part of kernel code(about 40 line) is about 0.2ms.
A part of kernel(about 40 line) execute only variable declarations and above shard memory accesses.
Thus, I think that the cause is shard memory accesses.

I’m very sorry for my poor English.

Check side effects of removing shared memory access, no way such instructions could slow down so big kernell by themself.

Why check side effects of removing shared memory access?
A part of my kernel(about 40 line) which isn’t removed shard memory accesses is too slow.
I wonder it.

A few insturctions could not slow down 600 line kernell.

The compiler aggressively eliminates unused instructions. So results that are only stored in registers are never computed. Results stored to shared memory are computed though, because they could potentially be used by other threads. This may make a store to shared memory appear to be slow, while it actually is the calculation that is taking up the time.

I see.
However, Why is execution time too slow?
I can’t understand the cause.

I’m sorry to be always poor at English.

You commented that the kernel time is 0.6 ms. Do you consider it slow? compared with which version?

You also commented that your program was about 0.3Gflops and you considered that slow. Gflops are other measure of performance: Giga Floating Point Operations Per Second. 0.3 looks like a slow value (what’s your device theoretical max Gflops). With that value I would think your kernel is limited by global memory. The amount of data you move from global memory compare with the amount of operations done by kernel is unbalanced.

Sometimes, divide a big kernel in others smallers have advantages (i.e.: resources used by block may be reduced and occupancy may be greater).

PS: This topic has 12 post.Do not apologies for your English ;).

Best regards.

Post some code here so that we have something to discuss (the mere statement “my kernel is slow” without further information isn’t really helpful).

Please enclose the code between [font=“Courier New”][code][/font]…[font=“Courier New”][/code][/font] tags to make it readable when posting here.

.

Thanks, everyone.

I use TeslaC1060 and CUDA 3.1.(Not using the computer of mine, I can’t use the latest version CUDA.)

According to NVIDIA, TeslaC1060 is capable of 933 GFLOPs/s of processing performance.

So, I expect that execution time, which is 0.3GFlops, is wrong.

I’m going to try it.

But, it’s so difficult…

Thank you very much!

I’m sorry.

A part of my kernel is

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

Assuming your blocksize.x is a multiple of 32, I could optimize this to

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

sA[t1].x     += 32*dx;

sA[t1].y     += 32*dy;

sA[t1].z     += 32*dz;

Is this really what you intended?

Sorry, I post the wrong code.

The right code is

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  (snip : calculating about dx, dy and dz)

sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

In that case the code should be optimized to

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  (snip : calculating dx, dy and dz quickly)

sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

0.3 Gflops is not the execution time. It’s the amount of Floating Point Operations Per Second. In your code, you are only accumulating an integer value. I’m not sure at the moment but i think those operations do not count for the GFLOPS. How did you calculate the execution time and the GFLOPS? Are you using the Compute Visual Profiler?.

PS: Tera, isn’t your last code the same than the above posted by Nori?. I dont see any difference (I’ve been working for more than 10 hours today, so I’m missing something :P).

Regards!