I am using a GPU 8600M GT which has 4 multiprocessors (SMs), but when I am running a kernel with many blocks e.g 100 the profiler shows that in one SM whre launched 50 blocks, which means tha only 2 SMs are used.
As mentioned in the profiler help, the profiler can only target one MP.
The four MPs on your GPU are grouped into 2 CTA’s (1 Cooperative thread array = 2 MPs)
So if you launch 100 blocks, 50 CTA’s will be launched for the targeted MP.
At the fourth slide, you can see how the MPS are grouped per two, so if you target 1 MP, the cta’s for the other MP’s are launched too.
In the OP’s case you would expect a value of 25 (100 block divided by 4 MPs).
I’m not sure if this is the actual reason but that’s how I made sense of it. Sorry for the extra confusion :-)
It’s more a problem of unclear documentation from nvidia.
If you follow, slide 9, and would take that as 1 MP, than your 8600M GPU (32 SMs) would only have 2MPs while the CUDA documentation says it has 4 MPs.
This way 8600M = 2MPs, so if only one MP is targeted, 50 cta’s are launched for 100 blocks.
So I would just stick with the (4MPs with 1MP = 8SM) and assume that it has to launch twice as many cta’s for the targeted MP because they are grouped by 2.
You may need to verify that you’ve actually got 4 MPs using the tool deviceQuery. My 8200 mGPU has only one MP although the CUDA Programming Guide say that it has two.
I don’t think that MPs should be in pairs. My 8200 mGPU has only one MP, and my GLx260 has 27.
SM = scalar multiprocessor, aka that thing that does actual computation.
CTA = block. Period, end of story. It runs on one SM.
TPC = thread/texture processing cluster (it depends on when you look at documentation and whether you’re looking at something that focuses on CUDA or on 3D graphics, I think). These are collections of SMs. On pre-GT200, you have 2 SMs per TPC. On GT200, it’s three.
Are you sure that particular counter is not per TPC? A number of profiler counters are per-TPC, so if this one is as well it would make perfect sense.
I think that’s the origin of the problem. The profiler help only states that it targets a single MP and (AFAIK) there’s no mention of per-TPC counters, so one would expect the number of cta’s launched in the profiler to be the total number of blocks divided by the amount of MPs
and not by the amount of TPCs as appears to be the case.
Because all of our internal documentation refers to CTAs and the term “block” is a CUDA invention…
(CTA = cooperative thread array, which probably makes a bit more sense than block when you know how they work, what the capabilities for sharing data are, etc.)