Visual Profiler and Tesla C1060 Where are teh statistics???

Most of the statistics are reported as 0 in the Visual Profiler when the Tesla C1060 is the device used.

I need this information to determine why a $1200 Tesla C1060 card has the same performance as a $70 cheapie GT 220.
Unfortunately I can’t determine the bottleneck in performance.

Here are two screenshots comparing the Visual Profiler output for the Tesla and for the GT cards. Notice all the 0’s for the Tesla and all the non-zeros for the GT 220.

Because your kernel is launch is too small, probably. The profiler collects data by instrumenting a few of the multiprocessors on a device (usually between 1-3), and then scaling that sample up to approximate the whole GPU. So if you don’t launch enough blocks to cover every multiprocessor in a GPU, there is no guarantee that you will get reliable profiler statistics. The reason why your GT220 gives data, while the C1060 doesn’t is because the GT220 only has 6MP, whereas the C1060 has 30MP.

The block size is 16 x 16.
The grid size is 64 x 1024.

That should be more than enough to make the kernel launch “large”, right?

I don’t think it has to do with inactive MP’s in my kernel. The Profiler reports zeros for my Tesla C1060 device, regardless of the kernel size.
Has anybody else seen this?

Are you compiling the executable for compute 1.3 architecture?

I compiled and profiled both 1.2 and 1.3 architectures. They have different execution times.

Either way, statistics are “0” for Tesla device.

How can I be sure that all MPs are being used??

By launching at least M*N blocks, where M is the number of blocks per MP which will run (you can use the cubin and occupancy calculator spreadsheet or the formulas in the programming guide for this), and N is the number of multiprocessors, which is 30 for your C1060.

The number of blocks I’m launching is much much greater than M*N. My question is:

How do I know the real-time scheduler/launcher is actually making use of all multi-processors?

P.S. thanks avidday. You are always a lifesaver.

You don’t. There isn’t presently anything exposed by the CUDA api that can show that level of detail (a GPU top style utility has been much requested, but nothing has appeared thus far). I am sure your problem must be more prosaic that that though. If your code isn’t collecting statistics, this usually means one of three things:


The code doesn’t contain enough work to reliably cover the instrumented CTA

The kernels are launching, but not running

A mismatch between driver and toolkit/profiler versions means statistics aren’t being collected

I think (1) can be ruled out, but what about (2) and (3)? What OS, toolkit and driver version are you using?

I’m ruling this out for reasons mentioned already.

The kernel output is correct. The kernel is definitely running. (Whether all MPs are utilized is another matter)

The GT 220 on the same system reports stats just fine. It uses the same toolkit and driver as the Tesla C1060.

Anyway, here are the details requested:

OS: Windows 7 Ultimate, 64-bit

Toolkit: CUDA 2.3

NVIDIA driver: 190.38

Try using the latest toolkit 3.0 with the latest drivers. It shall solve your problem?