I am new to CUDA programming and I’m trying to optimize a stencil kernel that uses shared memory a lot, and I hit the “invisible” roofline. Completely removed global memory operations from code to make sure that the reason is the access to shared memory. But I can’t figure out where the bottleneck is in the report of Nsight Compute.
According to the report, LSU utilization is close to 100%, but shared memory utilization is only 48%. And there are no bank conflicts. Does this mean that more read and write instructions cannot be executed by the pipeline, or do I just have a bad shared memory access pattern? Also, I am using half precision for shared memory array.
At the same time, when using single precision, shared memory utilization increases to 96% (LSU utilization stays close to 100%), but there are 2-way bank conflicts with all reads. This is probably because I have an array of structures and my structure consists of 2 numbers (which, in the case of half precision, fit in one cell of one bank). The occupancy is also decreasing. However, the execution time of the kernel becomes slightly shorter.
This is rather strange behavior. Perhaps I don’t understand some of the details of how shared memory is accessed. What could be the reason for this behavior?