Visual Profiler and Tesla C1060 Where are teh statistics???

Moiz_Ahmad · March 19, 2010, 7:51am

Most of the statistics are reported as 0 in the Visual Profiler when the Tesla C1060 is the device used.
WHY???

I need this information to determine why a $1200 Tesla C1060 card has the same performance as a $70 cheapie GT 220.
Unfortunately I can’t determine the bottleneck in performance.

Here are two screenshots comparing the Visual Profiler output for the Tesla and for the GT cards. Notice all the 0’s for the Tesla and all the non-zeros for the GT 220.

avidday · March 19, 2010, 8:07am

Because your kernel is launch is too small, probably. The profiler collects data by instrumenting a few of the multiprocessors on a device (usually between 1-3), and then scaling that sample up to approximate the whole GPU. So if you don’t launch enough blocks to cover every multiprocessor in a GPU, there is no guarantee that you will get reliable profiler statistics. The reason why your GT220 gives data, while the C1060 doesn’t is because the GT220 only has 6MP, whereas the C1060 has 30MP.

Moiz_Ahmad · March 19, 2010, 7:11pm

The block size is 16 x 16.
The grid size is 64 x 1024.

That should be more than enough to make the kernel launch “large”, right?

Moiz_Ahmad · March 19, 2010, 9:21pm

I don’t think it has to do with inactive MP’s in my kernel. The Profiler reports zeros for my Tesla C1060 device, regardless of the kernel size.
Has anybody else seen this?

avidday · March 19, 2010, 9:37pm

Are you compiling the executable for compute 1.3 architecture?

Moiz_Ahmad · March 19, 2010, 10:57pm

I compiled and profiled both 1.2 and 1.3 architectures. They have different execution times.

Either way, statistics are “0” for Tesla device.

Moiz_Ahmad · March 22, 2010, 7:50pm

How can I be sure that all MPs are being used??

avidday · March 22, 2010, 9:08pm

By launching at least M*N blocks, where M is the number of blocks per MP which will run (you can use the cubin and occupancy calculator spreadsheet or the formulas in the programming guide for this), and N is the number of multiprocessors, which is 30 for your C1060.

Moiz_Ahmad · March 22, 2010, 11:23pm

The number of blocks I’m launching is much much greater than M*N. My question is:

How do I know the real-time scheduler/launcher is actually making use of all multi-processors?

P.S. thanks avidday. You are always a lifesaver.

avidday · March 23, 2010, 8:08am

You don’t. There isn’t presently anything exposed by the CUDA api that can show that level of detail (a GPU top style utility has been much requested, but nothing has appeared thus far). I am sure your problem must be more prosaic that that though. If your code isn’t collecting statistics, this usually means one of three things:

[list=1]

[*]The code doesn’t contain enough work to reliably cover the instrumented CTA

[*]The kernels are launching, but not running

[*]A mismatch between driver and toolkit/profiler versions means statistics aren’t being collected

I think (1) can be ruled out, but what about (2) and (3)? What OS, toolkit and driver version are you using?

Moiz_Ahmad · March 23, 2010, 6:35pm

I’m ruling this out for reasons mentioned already.

The kernel output is correct. The kernel is definitely running. (Whether all MPs are utilized is another matter)

The GT 220 on the same system reports stats just fine. It uses the same toolkit and driver as the Tesla C1060.

Anyway, here are the details requested:

OS: Windows 7 Ultimate, 64-bit

Toolkit: CUDA 2.3

NVIDIA driver: 190.38

alandge · April 3, 2010, 7:07pm

Try using the latest toolkit 3.0 with the latest drivers. It shall solve your problem?

Topic		Replies	Views
Visual Profiler displays erroneous output with multiple GPUs Profiler problem on multi-gpu scaling b CUDA Programming and Performance	0	791	May 9, 2012
Time To Profile CUDA Programming and Performance	11	5616	October 20, 2011
Odd profiler results for Tesla C1060 (cta launched) CUDA Programming and Performance	2	647	April 19, 2011
Tesla problem - only 1 GPU detected CUDA Programming and Performance	2	2110	June 11, 2009
CUDA Profiler: cta launched counter CUDA Programming and Performance	4	9629	March 16, 2011
Understanding the memory latency when using CUDA profiler vs cudaEventRecord CUDA Programming and Performance	9	2082	November 11, 2010
CUDA Visual Profiler Warning (dropped rows) CUDA Programming and Performance	10	21615	October 7, 2011
CUDA Profiler documentation Few questions and some interesting facts CUDA Programming and Performance	5	6132	July 20, 2009
CUDA Visual Profiler Dies During Long Programs CUDA Programming and Performance	2	3424	August 5, 2010
Visual Profiler for MATLAB code? Unable to collect results! CUDA Programming and Performance	9	5479	October 7, 2013

Visual Profiler and Tesla C1060 Where are teh statistics???

Related topics