Tuning GPU code Profiler output interpretation

spencer · March 7, 2007, 9:06pm

I am curious about these output I got from the profiler when I ran my existing (unoptimized) code.

Is CPU time ~= GPU time a good thing?

Given my code is probably memory bound as the video data is copied into global memory (the 1st memcpy call) but I run about 86K threads in 338 blocks of 256 threads, should I expect to be able to do better than an occupancy of 66%?

spencer · March 7, 2007, 10:55pm

To followup, I experimented with changing the number of threads per block without changing my kernel code. There seems to be 2 sweet spots, 128 threads per block and 320 threads per block where the occupancy reaches 83%. I can understand one sweet spot due to optimal register allocation but 2 sweetspots?

Any thoughts?

Mark_Harris · March 8, 2007, 1:51pm

CPU Time includes GPU Time, so it should always be greater.

Mark

bbudge · March 25, 2007, 9:08pm

Could be that the second sweet spot is due to running more blocks concurrently. I my experience, usually this is with number of threads that are multiples of one another (i.e. 128 and 256), but could still be the case with your program.

spencer · March 26, 2007, 1:37pm

It could very well be the case as my kernel is memory bound. I have to store all the vide data in global memory becauses there just isn’t enough shared memory to store them even divided across multiprocessors. I try to read the bits I care about into registers when I can.

Experimenting with the occupancy spreadsheet seems to say occupancy is a function of the number of registers used since the amount of shared memory used have much less impact on occupancy.

Spencer

spencer · March 26, 2007, 4:22pm

I benchmarked the current version of my kernel for different number of theads (16 to 512) and though the occupancy rate varied from 0.333 to 0.667, the kernel executation time varied by less than 2% (1545 - 1584 microseconds).

So in this case, I expect I am limited by the access cost of global memory which I use plenty of.

Spencer

Topic		Replies	Views
CUDA perormances CUDA Programming and Performance	10	7130	January 22, 2008
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
Profiling my code I need some help to understand the output of the visual profiler CUDA Programming and Performance	5	1862	February 3, 2012
Image processing with CUDA: design question. CUDA Programming and Performance	5	1018	January 26, 2018
cuda visual profiler CUDA Programming and Performance	12	8169	July 30, 2008
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3317	March 13, 2007
Gap between measured perf. and peak CUDA Programming and Performance	8	13074	March 20, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Performance in different thread-block schemes CUDA Programming and Performance	5	2349	September 19, 2008
How to improve my kernel execution time? memory bound; occupancy; maxrregcount; cubin; math function CUDA Programming and Performance	0	1974	May 4, 2009

Tuning GPU code Profiler output interpretation

Related topics