Odd profiler results for Tesla C1060 (cta launched)

dhains · April 15, 2011, 7:59pm

When using computeprof from CUDA SDK 3.2, we are getting some strange results in regards to the cta launched measurements. We are using a simple vector add program.

Total number of thread blocks | cta launched (from computeprof)
60 | 6
120 | 12
240 | 24

As the profiler makes these measurements on a single multiprocessor, this suggests that 1/10th of the total number of thread blocks are being launched on a single multiprocessor. If the thread blocks are evenly distributed to multiprocessors, then only 10 of the 30 multiprocessors on the C1060 are being used.

Am I missing something here?

hyqneuron · April 16, 2011, 10:04am

You can use methods mentioned here to confirm things: http://forums.nvidia.com/index.php?showtopic=186669

dhains · April 19, 2011, 8:09am

OK, thanks. It appears that all 30 sms are being utilized as expected with even load balancing. This must just be a profiler issue.