CUDA Profiler: cta launched counter


I’m trying to understand how the “cta launched” counter in the CUDA Visual Profiler works.
I wrote a small program with a block size of 16x16 threads.
My grid has a size of 10x10 (= 100 blocks).
According to the specifications my graphics card (GeForce 8600M GT) has 4 multiprocessors.

When I’m using the (v1.1.7) the column “cta launched” shows “50”.
I know that this counter only reflects the activity of one MP but 100 blocks divided by 4 MPs is “25” not “50”.

Any ideas?


I’m using the 9500GT device with 4 multiprocessors and I’m noticing the same as you.

Some tests I’m made give the following results:
blocks 32 … “cta launched” is 16
blocks 16 … “cta launched” is 8
blocks 8 … “cta launched” is 4
blocks 2 … “cta launched” is 2

I just found this thread, then the answer. The hardware counters seem to be collected for N multiprocessors. For me, on a GTX280, I think N = 3 (at least the results make sense). For the first author, I think it’s N=2.

Hope that helps.

As i know, pre N multiprocessors share one SM controller. CTAs are dispatched by the shared SM controller, so “cta launched” is collected for N multiprocessors.

For G80, N=2, and N=3 for GTX200.



/usr/local/cuda/computeprof/doc/Compute_Visual_Profiler_User_Guide.pdf even explains this abbreviation in Table 1. “NVIDIA® CUDATM and OpenCL TM Terminology”