I’m not sure if this was discussed before, the forum search didn’t yield any results; I’ve found a fairly simple, and almost idiotic way to tell if a kernel is compute-bound, or memory-bound.
This works on Windows, I’m not sure if similar tools are available on Linux.
By installing NVIDIA System Tools, one has acces to GPU cough underclocking.
Measure the kernel execution time or performance under the following scenarios:
Default memory and shader clocks
Default memory and lowered shader clocks
Lowered memory and default shader clocks
If the kernel performance is lower with the lowered shader clock, then the kernel is compute bound, and vice-versa.
For example, on my 9800GT I lowered the memory clock from 950MHz to 273MHz, and the kernel performance was identical in both cases, but any modification of the shader clock causes a proportional change in kernel performance.
Of course, there is the posibility that both memory and shader changes will cause a reduction in kernel performance, in which case the kernel is “balanced”.
I do the same thing on Linux all the time. Search for NVIDIA coolbits to see how to enable it. However, it doesn’t seem to allow you to change the clocks on Tesla cards.
Interesting. I want to know how do you measure time for memory clock and shader clock ? Precisely, when we use cudaEventRecord(), are we measuring only processing time?
The kernel execution time is the processing time, and is the only time that you need; it is the variation of this time that is of interest. If you know the FLOP number of your kernel and divide that by the time to get performance, even better.
Personally, I prefer using QueryPerformanceTimer() (Windows) and gettimeofday() (Linux) for getting the timing information.
Just to verify my understanding wrt above, if your execution time is increased it will indicate that the kernel is compute limited, else if it is decreased it means it is memory accesses bound. Am I right?