Compute Visual Profiler- global memory throughput

crip_crop1 · April 14, 2011, 1:19pm

Hi there,

I’m currently trying to optimise my CUDA Fortran code to run like the wind, so to assist me I’ve drafted in the help of the Compute Visual Profiler. So far I’ve been very impressed with it but I need a bit more detail as to what the global memory throughput metric truly represents.

At the moment one of my kernels in my code runs with an overall global memory throughput of 1.8 GB/s (a little disappointing). As it happens, this is the slowest kernel in my code so I think the global memory stores/loads are what is limiting performance. But, what exactly does this metric represent?

This particular kernel has only a few global variable accesses but a lot more local variable accesses. As I understand it, local variables are thread private global variables. So, does my 1.8 GB/s global memory throughput represent solely accesses to global variables or does it in fact include accesses to local variables which have spilled over from the registers?

If anyone has any information about this issue I’d be really grateful for it as it will help me decide what variables I should put into shared memory to get the best performance.

Cheers,
Crip_crop

tlstar · April 14, 2011, 6:05pm

I have similar experiences in my coding, the bottleneck is the global memory accessing by GPU cores.

What does the local variable mean here? Local array variables inside kernel or device subroutines? The local variable would automatically transfer into register type; but the arrays do not (registers couldn’t be linear addressed) in O0 & O1 optimization mode.
In O2 and O3 model, the compiler seems to try to de-component the arrays, but the results is still poor. Sometimes, it make mistakes.

I have some experiences to transfer some small by key arrays into independent register handily. The performance is greatly achieved.

Using shared memory is a very good ideal, but we do not have so many shared memory per thread either. Also we need addressing costs.

For my Tesla M2050(Fermi,cc20)
Shared Memory per Multiprocessor (B): 49152
Number of Registers per Multiprocessor: 32768
Notice the unit in shared memory is in Byte, while the register is 32bit.

Which may mean, we can totally use 16K real(8) data in registers (some of them may be reserved by code, may be not if all the functions is inline type) and 6K real(8) data in shared memory for one block. The ideal is to use out both of them, but the compiler seems has some limitation (strange!!!) to use more registers.

Hope it helps.

Topic		Replies	Views
Tips to improve overall global memory throughtput Legacy PGI Compilers (archived)	3	3401	August 7, 2012
Global memory access bottleneck CUDA Programming and Performance	8	3673	September 4, 2015
I can't show the gpu details about memory throughput Visual Profiler and nvprof	0	813	July 21, 2021
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1954	February 3, 2012
Effective memory bandwidth? CUDA Programming and Performance	9	4228	July 26, 2021
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7581	January 25, 2010
Why so low requested global throughput CUDA Programming and Performance	0	561	August 11, 2016
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	25222	December 21, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17780	September 18, 2007
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5429	September 6, 2008

Compute Visual Profiler- global memory throughput

Related topics