How to tell if a kernel is memory or compute bound

Mr_Nuke · January 29, 2010, 6:57pm

I’m not sure if this was discussed before, the forum search didn’t yield any results; I’ve found a fairly simple, and almost idiotic way to tell if a kernel is compute-bound, or memory-bound.

This works on Windows, I’m not sure if similar tools are available on Linux.

By installing NVIDIA System Tools, one has acces to GPU cough underclocking.

Measure the kernel execution time or performance under the following scenarios:

Default memory and shader clocks
Default memory and lowered shader clocks
Lowered memory and default shader clocks

If the kernel performance is lower with the lowered shader clock, then the kernel is compute bound, and vice-versa.

For example, on my 9800GT I lowered the memory clock from 950MHz to 273MHz, and the kernel performance was identical in both cases, but any modification of the shader clock causes a proportional change in kernel performance.

Of course, there is the posibility that both memory and shader changes will cause a reduction in kernel performance, in which case the kernel is “balanced”.

plegresley · January 29, 2010, 7:22pm

I do the same thing on Linux all the time. Search for NVIDIA coolbits to see how to enable it. However, it doesn’t seem to allow you to change the clocks on Tesla cards.

Mr_Nuke · January 29, 2010, 8:48pm

Thanks for the hint.

As far as Tesla cards are concerned, I’m sure you can reproduce the results on a GeForce card of the same compute capability.

cirus · January 30, 2010, 3:41pm

I’m not sure if this was discussed before, the forum search didn’t yield any results; I’ve found a fairly simple, and almost idiotic way to tell if a kernel is compute-bound, or memory-bound.

This works on Windows, I’m not sure if similar tools are available on Linux.

By installing NVIDIA System Tools, one has acces to GPU cough underclocking.

Measure the kernel execution time or performance under the following scenarios:

Default memory and shader clocks

Default memory and lowered shader clocks

Lowered memory and default shader clocks

If the kernel performance is lower with the lowered shader clock, then the kernel is compute bound, and vice-versa.

For example, on my 9800GT I lowered the memory clock from 950MHz to 273MHz, and the kernel performance was identical in both cases, but any modification of the shader clock causes a proportional change in kernel performance.

Of course, there is the posibility that both memory and shader changes will cause a reduction in kernel performance, in which case the kernel is “balanced”.

Interesting. I want to know how do you measure time for memory clock and shader clock ? Precisely, when we use cudaEventRecord(), are we measuring only processing time?

And how do you change shader clock settings?

Mr_Nuke · January 30, 2010, 3:48pm

The kernel execution time is the processing time, and is the only time that you need; it is the variation of this time that is of interest. If you know the FLOP number of your kernel and divide that by the time to get performance, even better.

Personally, I prefer using QueryPerformanceTimer() (Windows) and gettimeofday() (Linux) for getting the timing information.

Windows - try Nvidia Control Panel

Linux - haven’t tried it yet

eyalhir74 · January 31, 2010, 8:38am

Nice :)

I usually replace the memory accesses in the main loop (which obviously takes the most time) to dummy calculations and

compare the timings.

i.e, instead of this:

float fResult = gData[ threadIdx.x ];

gOutput[ threadIdx.x ] += fResult;

I do this:

float fResult = blockIdx.x;

gOutput[ threadIdx.x ] += fResult;

Not optimal but works great… :)

Also since I mainly use textures the visual profiler (as far as I remember) is also useless in relation to counters with textures

at least till 3.0.

eyal

cudacuda2009 · February 3, 2010, 2:43pm

I usually replace the memory accesses in the main loop (which obviously takes the most time) to dummy calculations and

compare the timings.

i.e, instead of this:
float fResult = gData[ threadIdx.x ];

gOutput[ threadIdx.x ] += fResult;
I do this:
float fResult = blockIdx.x;

gOutput[ threadIdx.x ] += fResult;
eyal

Both the methods are very interesting.

Just to verify my understanding wrt above, if your execution time is increased it will indicate that the kernel is compute limited, else if it is decreased it means it is memory accesses bound. Am I right?

eyalhir74 · February 3, 2010, 4:06pm

Yes :) since most of the kernels are memory bounded in the first place, you’ll probably see a much faster kernel.

Judging from my experience, its the fastest and easiest, method to fine-tune the code. You get a feeling where your

kernel wastes time and then you can try to optimize that part and not waste time on optimizing non crucial code.

You do need to take extra care as not to cause the dead-code optimizer to optimize out your code - otherwise your tests

will not be valid.

eyal

cudacuda2009 · February 4, 2010, 11:33am

Thanks. I found your method more clean than changing the clock frequency every now and then.

Topic		Replies	Views
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	6006	April 3, 2010
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
how to evaluate the CUDA's performance how can i know the program is optimazed CUDA Programming and Performance	7	7338	July 24, 2008
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6019	September 8, 2009
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022
CUDA OpenCL comparison CUDA Programming and Performance	9	3403	August 23, 2011
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13451	July 9, 2008
Timing inside the kernel How to measure times inside the kernel? CUDA Programming and Performance	10	12058	December 21, 2009
5 seconds limitation? or a bug in my kernel? CUDA Programming and Performance	2	2334	October 17, 2007

How to tell if a kernel is memory or compute bound

Related topics