Is there any tool which can tell my kernel is compute bound or memory bound

A tool which works for any cuda program and tells me is it memory bound or compute bound

Thanks for the help

not as such as far as I know. You have the profiler that will tell you how effective your load/stores are (coalescing, cache misses etc.), and nexus which will tell you how hard the GPU is working.

You can also comment out compute stuff to see how well your communication is doing.

It’s usually trying to approximate how much communication is going on compared to bandwidth and how much math is going on compared to peak GFLOPS and see which one is higher

If you can find a tool that lets you adjust the clock rates of your card, then I’ve seen people mention a neat trick to test this. Benchmark your code, then turn the core and shader clock down 15% and rerun the benchmark. Similarly, put the core and shader clock back to nominal and turn down the memory clock by 15%. By comparing the runtimes of the three tests, you should get a good sense of whether your code is compute or memory bound.

The aggregate memory throughput counters in the visual profiler are pretty useful too (although the ones in the linux 3.0 release profiler are broken, at least for my compute 1.3 hardware). You can see the read and write throughput, and compare it to the specs for you card. It is usually possible to hit 90% of the theoretical bandwidth on all the hardware I have tried.

Great tip! This one goes into my list.

You could probably even automate this with a launcher program that uses NVAPI.

Something like:

[indent][font=“Courier New”]

nvscale --auto cuda-program

nvscale --memory -15 cuda-program

nvscale --core+shader -15 cuda-program

[/font][/indent]

Where [font=“Courier New”]–auto[/font] runs the program three times, scaling memory and cpu clocks just as you describe.

Good one, Thanks

This is much simpler method to test the program, thank you

Down-clocking sounds like an awesome idea indeed… I’m on Linux, and was not messing with under/over-clocking before. I found that one possible approach would be to enable corresponding controls in nvidia-settings application, by turning “Coolbits” option on in my xorg.conf file; so - is this right tool to use under Linux? Also: as I’m able to change core clock only this way - is “shader” clock changed with the same scale as core clock (for example, core clock on my Quadro FX 770m is 500MHz, while “shader” clock is 1.25 GHz, so if I decrease core clock by 15%, is “shader” clock going to be decreased by 15% too?)? Further: are there any changes needed under “PowerMizer” section - I have “adaptive” mode turned on there, and I noticed that as soon as I start any kind of CUDA application, clocks are going up to their maximum value, but would it be still needed, just to be sure, to change the preferred mode to “prefer maximum performance” while doing measurements?