cuda profiler: cta_launched? what does it measure and why?

My rather simple question is:
What exactly does cta_launched measure? I have to admit I obviously don’t fully understand what “Number of CTAs launched on the PM TPC” means, as it is described in the README.
Is it the number of thread blocks running on that very multiprocessor that is actually being profiled?

Rationale for my question:

I have the following grid layout:
dimGrid(32,128) (resulting from a model domain of 128x128x128 points)

so, a grid of 4096 blocks each consisting of 512 threads.

Using a C870 which has 16 multiprocessors (mp) I would assume a number of 4096/16 =256 blocks per mp. The cuda profiler reports cta_launched = 511. So, the blocks are not evenly distributed over the available mp? In fact, if I run the program repeatedly, cta_launched even varies (508…518). Is this due to the non-optimal occupancy (.67)?

Thanks for any enlightening comment


Yes, CTA stands for the number of blocks…

Probably C870 has lesser MPs. Are you sure? And, probably some of your blocks exit quickly resuling in an un-balanced load among your MP.