Odd profiler results for Tesla C1060 (cta launched)

When using computeprof from CUDA SDK 3.2, we are getting some strange results in regards to the cta launched measurements. We are using a simple vector add program.

Total number of thread blocks | cta launched (from computeprof)
60 | 6
120 | 12
240 | 24

As the profiler makes these measurements on a single multiprocessor, this suggests that 1/10th of the total number of thread blocks are being launched on a single multiprocessor. If the thread blocks are evenly distributed to multiprocessors, then only 10 of the 30 multiprocessors on the C1060 are being used.

Am I missing something here?

You can use methods mentioned here to confirm things: http://forums.nvidia.com/index.php?showtopic=186669

OK, thanks. It appears that all 30 sms are being utilized as expected with even load balancing. This must just be a profiler issue.