Most of the statistics are reported as 0 in the Visual Profiler when the Tesla C1060 is the device used.
WHY???
I need this information to determine why a $1200 Tesla C1060 card has the same performance as a $70 cheapie GT 220.
Unfortunately I can’t determine the bottleneck in performance.
Here are two screenshots comparing the Visual Profiler output for the Tesla and for the GT cards. Notice all the 0’s for the Tesla and all the non-zeros for the GT 220.
Because your kernel is launch is too small, probably. The profiler collects data by instrumenting a few of the multiprocessors on a device (usually between 1-3), and then scaling that sample up to approximate the whole GPU. So if you don’t launch enough blocks to cover every multiprocessor in a GPU, there is no guarantee that you will get reliable profiler statistics. The reason why your GT220 gives data, while the C1060 doesn’t is because the GT220 only has 6MP, whereas the C1060 has 30MP.
I don’t think it has to do with inactive MP’s in my kernel. The Profiler reports zeros for my Tesla C1060 device, regardless of the kernel size.
Has anybody else seen this?
By launching at least M*N blocks, where M is the number of blocks per MP which will run (you can use the cubin and occupancy calculator spreadsheet or the formulas in the programming guide for this), and N is the number of multiprocessors, which is 30 for your C1060.
You don’t. There isn’t presently anything exposed by the CUDA api that can show that level of detail (a GPU top style utility has been much requested, but nothing has appeared thus far). I am sure your problem must be more prosaic that that though. If your code isn’t collecting statistics, this usually means one of three things:
[list=1]
[*]The code doesn’t contain enough work to reliably cover the instrumented CTA
[*]The kernels are launching, but not running
[*]A mismatch between driver and toolkit/profiler versions means statistics aren’t being collected
I think (1) can be ruled out, but what about (2) and (3)? What OS, toolkit and driver version are you using?