Memory workload analysis

I a using NSIGHT Compute to optimize the performance of my CUDA code. In the report, the message below was identified as one of the areas for performance improvement.

“The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To further improve performance, work will likely need to be shifted from the most utilized to another unit. Start by analyzing workloads in the Memory Workload Analysis section.”

Here’s the memory workload analysis:

I am attempting to infer the results. Is the report suggesting that I move some of the data to local or shared memory to help improve performance?

That bright orange line that represents the read path from device memory is the point of focus. The chart is color coded, and you can see that bright orange indicates a measurement in the ~80% of peak theoretical range.

So your code is memory bound. This is not surprising. If your algorithm is memory bound (e.g. vector add) there is likely not much you can do about it.

But there are various suggestions. The two most common that come to mind are:

  • make sure your global loads are coalesced
  • take advantage of all caches in the architecture, as well as shared memory, to cache reused data