Compute Visual Profiler- global memory throughput

Hi there,

I’m currently trying to optimise my CUDA Fortran code to run like the wind, so to assist me I’ve drafted in the help of the Compute Visual Profiler. So far I’ve been very impressed with it but I need a bit more detail as to what the global memory throughput metric truly represents.

At the moment one of my kernels in my code runs with an overall global memory throughput of 1.8 GB/s (a little disappointing). As it happens, this is the slowest kernel in my code so I think the global memory stores/loads are what is limiting performance. But, what exactly does this metric represent?

This particular kernel has only a few global variable accesses but a lot more local variable accesses. As I understand it, local variables are thread private global variables. So, does my 1.8 GB/s global memory throughput represent solely accesses to global variables or does it in fact include accesses to local variables which have spilled over from the registers?

If anyone has any information about this issue I’d be really grateful for it as it will help me decide what variables I should put into shared memory to get the best performance.


I have similar experiences in my coding, the bottleneck is the global memory accessing by GPU cores.

What does the local variable mean here? Local array variables inside kernel or device subroutines? The local variable would automatically transfer into register type; but the arrays do not (registers couldn’t be linear addressed) in O0 & O1 optimization mode.
In O2 and O3 model, the compiler seems to try to de-component the arrays, but the results is still poor. Sometimes, it make mistakes.

I have some experiences to transfer some small by key arrays into independent register handily. The performance is greatly achieved.

Using shared memory is a very good ideal, but we do not have so many shared memory per thread either. Also we need addressing costs.

For my Tesla M2050(Fermi,cc20)
Shared Memory per Multiprocessor (B): 49152
Number of Registers per Multiprocessor: 32768
Notice the unit in shared memory is in Byte, while the register is 32bit.

Which may mean, we can totally use 16K real(8) data in registers (some of them may be reserved by code, may be not if all the functions is inline type) and 6K real(8) data in shared memory for one block. The ideal is to use out both of them, but the compiler seems has some limitation (strange!!!) to use more registers.

Hope it helps.