When using computeprof from CUDA SDK 3.2, we are getting some strange results in regards to the cta launched measurements. We are using a simple vector add program.
Total number of thread blocks | cta launched (from computeprof)
60 | 6
120 | 12
240 | 24
As the profiler makes these measurements on a single multiprocessor, this suggests that 1/10th of the total number of thread blocks are being launched on a single multiprocessor. If the thread blocks are evenly distributed to multiprocessors, then only 10 of the 30 multiprocessors on the C1060 are being used.
Am I missing something here?