Shared Memory Bandwidth

I have a kernel that basically reads data from global memory into shared memory, do some calculation, then write to output global memory. The memory access is coalesced as it can be seen from the profiler, and there is no bank conflict. The occupancy is 25% as it uses about 2k shared memory for each block and I am using 32 threads for one block. The active threads are 15 but the eligible threads are 1.5. There is some code branch but it is required by the application. The shared mem stats shows that SM to shared bandwidth used is about 210GB/s. IPC issued and executed are very close (1.25), instruction serialization is 9%.

Is there any room to improved the performance based on the above numbers? Is the bottleneck on the shared memory bandwidth? Or maybe the instruction dependency makes the eligible threads low (is there a way to tell)?

Attached the exported profiler results and device summary.

You might want to state what GPU is being used here, so performance numbers can be put into perpective.

It is in the attachment. It’s k20c. Not sure if the attachment can be seen, it’s constantly “scanning”.

I do not see an attachment. I have a K20c here and as one piece of anecdotal data, I am running some dense solver at the moment that uses many shared memory operations for which the profiler reports 592 GB/s of shared memory loads and 301 GB/sec of shared memory stores. By comparison with your numbers, these numbers suggest that your code is not limited by shared memory bandwidth.

Without seeing either code or profiler output, my best guess is that you would want to increase the number of threads actively doing work. You may also want to look into the global memory bandwidth needs of your code. If there isn’t much processing for each byte read, that could be limiting performance. Consider that with ECC enabled the K20c provides about 150 GB/sec throughput when 64-bit accesses are used, while being able to crank up to about 1100 DP GFLOPS in DGEMM.

You may want to give the guided optimization feature of the CUDA 5.5 profiler a try: