what is going on in this thread
SM = scalar multiprocessor, aka that thing that does actual computation.
CTA = block. Period, end of story. It runs on one SM.
TPC = thread/texture processing cluster (it depends on when you look at documentation and whether you’re looking at something that focuses on CUDA or on 3D graphics, I think). These are collections of SMs. On pre-GT200, you have 2 SMs per TPC. On GT200, it’s three.
Are you sure that particular counter is not per TPC? A number of profiler counters are per-TPC, so if this one is as well it would make perfect sense.