Shared Memory Bandwidth

ndzuser · August 2, 2013, 8:07pm

I have a kernel that basically reads data from global memory into shared memory, do some calculation, then write to output global memory. The memory access is coalesced as it can be seen from the profiler, and there is no bank conflict. The occupancy is 25% as it uses about 2k shared memory for each block and I am using 32 threads for one block. The active threads are 15 but the eligible threads are 1.5. There is some code branch but it is required by the application. The shared mem stats shows that SM to shared bandwidth used is about 210GB/s. IPC issued and executed are very close (1.25), instruction serialization is 9%.

Is there any room to improved the performance based on the above numbers? Is the bottleneck on the shared memory bandwidth? Or maybe the instruction dependency makes the eligible threads low (is there a way to tell)?

Attached the exported profiler results and device summary.

njuffa · August 3, 2013, 12:28am

You might want to state what GPU is being used here, so performance numbers can be put into perpective.

ndzuser · August 3, 2013, 2:23am

It is in the attachment. It’s k20c. Not sure if the attachment can be seen, it’s constantly “scanning”.

njuffa · August 3, 2013, 3:17am

I do not see an attachment. I have a K20c here and as one piece of anecdotal data, I am running some dense solver at the moment that uses many shared memory operations for which the profiler reports 592 GB/s of shared memory loads and 301 GB/sec of shared memory stores. By comparison with your numbers, these numbers suggest that your code is not limited by shared memory bandwidth.

Without seeing either code or profiler output, my best guess is that you would want to increase the number of threads actively doing work. You may also want to look into the global memory bandwidth needs of your code. If there isn’t much processing for each byte read, that could be limiting performance. Consider that with ECC enabled the K20c provides about 150 GB/sec throughput when 64-bit accesses are used, while being able to crank up to about 1100 DP GFLOPS in DGEMM.

You may want to give the guided optimization feature of the CUDA 5.5 profiler a try:
[url]https://developer.nvidia.com/nvidia-visual-profiler[/url]

Topic		Replies	Views
shared memory Computation become slower when using the shared memory CUDA Programming and Performance	8	1813	August 20, 2010
How to verify that the shared is used as declared? CUDA Programming and Performance	2	980	March 19, 2009
What does the "shared_efficiency" really mean? CUDA Programming and Performance	5	2341	November 16, 2023
GK104 / GK110 shared memory bandwidth discussion CUDA Programming and Performance	7	2021	December 2, 2012
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1861	February 3, 2012
Effective memory bandwidth? CUDA Programming and Performance	9	3759	July 26, 2021
[SOLVED] Concurrent Kernel Execution CUDA Programming and Performance	7	5889	May 21, 2016
I had a kernel that needs more than the 48k+16kcache of each SM of a K40 CUDA Programming and Performance	8	763	December 18, 2014
Shared Memory Buffer CUDA Programming and Performance	1	2681	May 13, 2011
A couple of questions CUDA Programming and Performance	5	2057	December 2, 2008

Shared Memory Bandwidth

Related topics