I’m trying to understand how the “cta launched” counter in the CUDA Visual Profiler works.
I wrote a small program with a block size of 16x16 threads.
My grid has a size of 10x10 (= 100 blocks).
According to the specifications my graphics card (GeForce 8600M GT) has 4 multiprocessors.
When I’m using the Profiler.app (v1.1.7) the column “cta launched” shows “50”.
I know that this counter only reflects the activity of one MP but 100 blocks divided by 4 MPs is “25” not “50”.
I’m using the 9500GT device with 4 multiprocessors and I’m noticing the same as you.
Some tests I’m made give the following results:
blocks 32 … “cta launched” is 16
blocks 16 … “cta launched” is 8
blocks 8 … “cta launched” is 4
blocks 2 … “cta launched” is 2
I just found this thread, then the answer. The hardware counters seem to be collected for N multiprocessors. For me, on a GTX280, I think N = 3 (at least the results make sense). For the first author, I think it’s N=2.
As i know, pre N multiprocessors share one SM controller. CTAs are dispatched by the shared SM controller, so “cta launched” is collected for N multiprocessors.
/usr/local/cuda/computeprof/doc/Compute_Visual_Profiler_User_Guide.pdf even explains this abbreviation in Table 1. “NVIDIA® CUDATM and OpenCL TM Terminology”