I’m currently trying to optimise my CUDA Fortran code to run like the wind, so to assist me I’ve drafted in the help of the Compute Visual Profiler. So far I’ve been very impressed with it but I need a bit more detail as to what the global memory throughput metric truly represents.
At the moment one of my kernels in my code runs with an overall global memory throughput of 1.8 GB/s (a little disappointing). As it happens, this is the slowest kernel in my code so I think the global memory stores/loads are what is limiting performance. But, what exactly does this metric represent?
This particular kernel has only a few global variable accesses but a lot more local variable accesses. As I understand it, local variables are thread private global variables. So, does my 1.8 GB/s global memory throughput represent solely accesses to global variables or does it in fact include accesses to local variables which have spilled over from the registers?
If anyone has any information about this issue I’d be really grateful for it as it will help me decide what variables I should put into shared memory to get the best performance.