shard memory accesses is slow...

Nori.H · December 17, 2011, 2:00pm

Hello.

My CUDA program is so slow.

So I investigated the cause, I found out that the cause was the following codes.

sA[threadIdx.x].x     += dx;

sA[threadIdx.x].y     += dy;

sA[threadIdx.x].z     += dz;

sA is a array of structures.

The structure is the following.

struct stA {

    float r;

    float s;

    float t;

    float u;

    float v;

    float w;

    float x;

    float y;

    float z;

};

I thought that the cause was “shared memory bank conflicts” or “occupancy”.

However, according to CUDA Profiler,

warp_serialize = 0

Occupancy = 0.312

I think that neither “bank conflicts” nor “occupancy” are the cause.

So, What is the cause?

Can you help me?

I’m sorry for my poor English.

Thanks.

Lev · December 17, 2011, 3:12pm

You are probably wrong with the cause of your program being slow. How did you found what makes your program slow? Looks like no large bank conflicts there.

Nori.H · December 17, 2011, 3:58pm

Thanks for your answer.

I found that this program was about 0.3Gflops.
It’s too slow.
So, I investigated the cause.
And, I found out that the cause was the codes of “shard memory access”.

Lev · December 17, 2011, 5:52pm

And how did you found it?

Nori.H · December 18, 2011, 1:34am

I removed those shard memory accesses from my program.
So, execution time is so faster.

Lev · December 18, 2011, 10:56am

This is a mistake, compiler removes other calculations too in that case, cause their result is not used if you remove store. You better post full code of the kernell. And check your measures, if you include start up time, copy time etc.

Nori.H · December 18, 2011, 5:03pm

Thanks for your advice.

My kernel has about 600 line.
Therefore, I can’t post full code.

Execution time of full kernel code(about 600 line) is about 0.6ms.
But, execution time of a part of kernel code(about 40 line) is about 0.2ms.
A part of kernel(about 40 line) execute only variable declarations and above shard memory accesses.
Thus, I think that the cause is shard memory accesses.

I’m very sorry for my poor English.

Lev · December 18, 2011, 8:09pm

Check side effects of removing shared memory access, no way such instructions could slow down so big kernell by themself.

Nori.H · December 19, 2011, 2:55am

Why check side effects of removing shared memory access?
A part of my kernel(about 40 line) which isn’t removed shard memory accesses is too slow.
I wonder it.

Lev · December 19, 2011, 10:35am

A few insturctions could not slow down 600 line kernell.

tera · December 19, 2011, 11:12am

The compiler aggressively eliminates unused instructions. So results that are only stored in registers are never computed. Results stored to shared memory are computed though, because they could potentially be used by other threads. This may make a store to shared memory appear to be slow, while it actually is the calculation that is taking up the time.

Nori.H · December 20, 2011, 8:02am

I see.
However, Why is execution time too slow?
I can’t understand the cause.

I’m sorry to be always poor at English.

pQB · December 20, 2011, 8:21am

You commented that the kernel time is 0.6 ms. Do you consider it slow? compared with which version?

You also commented that your program was about 0.3Gflops and you considered that slow. Gflops are other measure of performance: Giga Floating Point Operations Per Second. 0.3 looks like a slow value (what’s your device theoretical max Gflops). With that value I would think your kernel is limited by global memory. The amount of data you move from global memory compare with the amount of operations done by kernel is unbalanced.

Sometimes, divide a big kernel in others smallers have advantages (i.e.: resources used by block may be reduced and occupancy may be greater).

PS: This topic has 12 post.Do not apologies for your English External Image.

Best regards.

tera · December 20, 2011, 12:04pm

Post some code here so that we have something to discuss (the mere statement “my kernel is slow” without further information isn’t really helpful).

Please enclose the code between [font=“Courier New”][code][/font]…[font=“Courier New”][/code][/font] tags to make it readable when posting here.

Nori.H · December 20, 2011, 3:14pm

.

Nori.H · December 20, 2011, 3:17pm

Thanks, everyone.

I use TeslaC1060 and CUDA 3.1.(Not using the computer of mine, I can’t use the latest version CUDA.)

According to NVIDIA, TeslaC1060 is capable of 933 GFLOPs/s of processing performance.

So, I expect that execution time, which is 0.3GFlops, is wrong.

I’m going to try it.

But, it’s so difficult…

Thank you very much!

I’m sorry.

A part of my kernel is

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

tera · December 20, 2011, 4:02pm

Assuming your blocksize.x is a multiple of 32, I could optimize this to

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

sA[t1].x     += 32*dx;

sA[t1].y     += 32*dy;

sA[t1].z     += 32*dz;

Is this really what you intended?

Nori.H · December 20, 2011, 4:40pm

Sorry, I post the wrong code.

The right code is

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  (snip : calculating about dx, dy and dz)

sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

tera · December 20, 2011, 5:33pm

In that case the code should be optimized to

extern __shared__ stA s[];

unsigned int t1 = threadIdx.x & 31;

unsigned int t2 = threadIdx.x - t1;

stA* sA = &s[t2];

(snip)

for (unsigned int i = 0; i < 32; i++)

{

  (snip : calculating dx, dy and dz quickly)

sA[t1].x     += dx;

  sA[t1].y     += dy;

  sA[t1].z     += dz;

  t1            = (t1 + 1) & 31;

}

pQB · December 20, 2011, 6:17pm

0.3 Gflops is not the execution time. It’s the amount of Floating Point Operations Per Second. In your code, you are only accumulating an integer value. I’m not sure at the moment but i think those operations do not count for the GFLOPS. How did you calculate the execution time and the GFLOPS? Are you using the Compute Visual Profiler?.

PS: Tera, isn’t your last code the same than the above posted by Nori?. I dont see any difference (I’ve been working for more than 10 hours today, so I’m missing something :P).

Regards!

Topic		Replies	Views
Shared memory problem CUDA Programming and Performance	10	4116	April 20, 2010
why is shared memory example not faster CUDA Programming and Performance	7	1406	May 16, 2012
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3228	May 6, 2007
Registers and Shared Memory CUDA Programming and Performance	18	4080	June 16, 2010
Shared memory as slow as global memory CUDA Programming and Performance	8	4613	September 5, 2016
Optimized shared memory index mapping function CUDA Programming and Performance	16	1540	March 20, 2023
Optimizing bank conflicts - problem with occupancy CUDA Programming and Performance	12	2408	April 22, 2010
Extremely slow smem reads Possible bank conflicts? CUDA Programming and Performance	11	3658	November 25, 2008
Using shared memory of a block to improve performance Legacy PGI Compilers	7	4481	April 18, 2019
Shared Memory Persistence CUDA Programming and Performance	17	2470	November 25, 2020

shard memory accesses is slow...

Related topics