How to tell if a kernel is memory or compute bound

I’m not sure if this was discussed before, the forum search didn’t yield any results; I’ve found a fairly simple, and almost idiotic way to tell if a kernel is compute-bound, or memory-bound.

This works on Windows, I’m not sure if similar tools are available on Linux.

By installing NVIDIA System Tools, one has acces to GPU cough underclocking.

Measure the kernel execution time or performance under the following scenarios:

  1. Default memory and shader clocks
  2. Default memory and lowered shader clocks
  3. Lowered memory and default shader clocks

If the kernel performance is lower with the lowered shader clock, then the kernel is compute bound, and vice-versa.

For example, on my 9800GT I lowered the memory clock from 950MHz to 273MHz, and the kernel performance was identical in both cases, but any modification of the shader clock causes a proportional change in kernel performance.

Of course, there is the posibility that both memory and shader changes will cause a reduction in kernel performance, in which case the kernel is “balanced”.

I do the same thing on Linux all the time. Search for NVIDIA coolbits to see how to enable it. However, it doesn’t seem to allow you to change the clocks on Tesla cards.

Thanks for the hint.

As far as Tesla cards are concerned, I’m sure you can reproduce the results on a GeForce card of the same compute capability.

Interesting. I want to know how do you measure time for memory clock and shader clock ? Precisely, when we use cudaEventRecord(), are we measuring only processing time?

And how do you change shader clock settings?

The kernel execution time is the processing time, and is the only time that you need; it is the variation of this time that is of interest. If you know the FLOP number of your kernel and divide that by the time to get performance, even better.

Personally, I prefer using QueryPerformanceTimer() (Windows) and gettimeofday() (Linux) for getting the timing information.

Windows - try Nvidia Control Panel

Linux - haven’t tried it yet

Nice :)

I usually replace the memory accesses in the main loop (which obviously takes the most time) to dummy calculations and

compare the timings.

i.e, instead of this:

float fResult = gData[ threadIdx.x ];

gOutput[ threadIdx.x ] += fResult;

I do this:

float fResult = blockIdx.x;

gOutput[ threadIdx.x ] += fResult;

Not optimal but works great… :)

Also since I mainly use textures the visual profiler (as far as I remember) is also useless in relation to counters with textures

at least till 3.0.

eyal

Both the methods are very interesting.

Just to verify my understanding wrt above, if your execution time is increased it will indicate that the kernel is compute limited, else if it is decreased it means it is memory accesses bound. Am I right?

Yes :) since most of the kernels are memory bounded in the first place, you’ll probably see a much faster kernel.

Judging from my experience, its the fastest and easiest, method to fine-tune the code. You get a feeling where your

kernel wastes time and then you can try to optimize that part and not waste time on optimizing non crucial code.

You do need to take extra care as not to cause the dead-code optimizer to optimize out your code - otherwise your tests

will not be valid.

eyal

Thanks. I found your method more clean than changing the clock frequency every now and then.