You are probably wrong with the cause of your program being slow. How did you found what makes your program slow? Looks like no large bank conflicts there.
I found that this program was about 0.3Gflops.
It’s too slow.
So, I investigated the cause.
And, I found out that the cause was the codes of “shard memory access”.
This is a mistake, compiler removes other calculations too in that case, cause their result is not used if you remove store. You better post full code of the kernell. And check your measures, if you include start up time, copy time etc.
My kernel has about 600 line.
Therefore, I can’t post full code.
Execution time of full kernel code(about 600 line) is about 0.6ms.
But, execution time of a part of kernel code(about 40 line) is about 0.2ms.
A part of kernel(about 40 line) execute only variable declarations and above shard memory accesses.
Thus, I think that the cause is shard memory accesses.
Why check side effects of removing shared memory access?
A part of my kernel(about 40 line) which isn’t removed shard memory accesses is too slow.
I wonder it.
The compiler aggressively eliminates unused instructions. So results that are only stored in registers are never computed. Results stored to shared memory are computed though, because they could potentially be used by other threads. This may make a store to shared memory appear to be slow, while it actually is the calculation that is taking up the time.
You commented that the kernel time is 0.6 ms. Do you consider it slow? compared with which version?
You also commented that your program was about 0.3Gflops and you considered that slow. Gflops are other measure of performance: Giga Floating Point Operations Per Second. 0.3 looks like a slow value (what’s your device theoretical max Gflops). With that value I would think your kernel is limited by global memory. The amount of data you move from global memory compare with the amount of operations done by kernel is unbalanced.
Sometimes, divide a big kernel in others smallers have advantages (i.e.: resources used by block may be reduced and occupancy may be greater).
PS: This topic has 12 post.Do not apologies for your English External Image.
0.3 Gflops is not the execution time. It’s the amount of Floating Point Operations Per Second. In your code, you are only accumulating an integer value. I’m not sure at the moment but i think those operations do not count for the GFLOPS. How did you calculate the execution time and the GFLOPS? Are you using the Compute Visual Profiler?.
PS: Tera, isn’t your last code the same than the above posted by Nori?. I dont see any difference (I’ve been working for more than 10 hours today, so I’m missing something :P).