Why I can't use all the multiprocessors

Hi everybody,

I am using a GPU 8600M GT which has 4 multiprocessors (SMs), but when I am running a kernel with many blocks e.g 100 the profiler shows that in one SM whre launched 50 blocks, which means tha only 2 SMs are used.

Why this happens? Can anyone help me please?

As mentioned in the profiler help, the profiler can only target one MP.
The four MPs on your GPU are grouped into 2 CTA’s (1 Cooperative thread array = 2 MPs)
So if you launch 100 blocks, 50 CTA’s will be launched for the targeted MP.

N.

Thank you, I thought that the launched CTAs are for one MP

This is the first time I hear about CTAs spanning several MPs. Where in the CUDA documentation can I find more information about this topic?

So far I thought 1 thread block is called a CTA and is exclusively assigned to 1 MP.

Christian

Actually, my explanation was a bit ambiguous,
Take a look at:

[url=“http://www.hotchips.org/archives/hc19/2_Mon/HC19.02/HC19.02.02.pdf”]http://www.hotchips.org/archives/hc19/2_Mo.../HC19.02.02.pdf[/url]

At the fourth slide, you can see how the MPS are grouped per two, so if you target 1 MP, the cta’s for the other MP’s are launched too.
In the OP’s case you would expect a value of 25 (100 block divided by 4 MPs).

I’m not sure if this is the actual reason but that’s how I made sense of it. Sorry for the extra confusion :-)

N.

Check slide 9, it shows two different CTAs on the same Group of SMs, so probably my problem still exist :)

It’s more a problem of unclear documentation from nvidia.
If you follow, slide 9, and would take that as 1 MP, than your 8600M GPU (32 SMs) would only have 2MPs while the CUDA documentation says it has 4 MPs.
This way 8600M = 2MPs, so if only one MP is targeted, 50 cta’s are launched for 100 blocks.

So I would just stick with the (4MPs with 1MP = 8SM) and assume that it has to launch twice as many cta’s for the targeted MP because they are grouped by 2.

N.

You may need to verify that you’ve actually got 4 MPs using the tool deviceQuery. My 8200 mGPU has only one MP although the CUDA Programming Guide say that it has two.

I don’t think that MPs should be in pairs. My 8200 mGPU has only one MP, and my GLx260 has 27.

what is going on in this thread

SM = scalar multiprocessor, aka that thing that does actual computation.

CTA = block. Period, end of story. It runs on one SM.

TPC = thread/texture processing cluster (it depends on when you look at documentation and whether you’re looking at something that focuses on CUDA or on 3D graphics, I think). These are collections of SMs. On pre-GT200, you have 2 SMs per TPC. On GT200, it’s three.

Are you sure that particular counter is not per TPC? A number of profiler counters are per-TPC, so if this one is as well it would make perfect sense.

I think that’s the origin of the problem. The profiler help only states that it targets a single MP and (AFAIK) there’s no mention of per-TPC counters, so one would expect the number of cta’s launched in the profiler to be the total number of blocks divided by the amount of MPs

and not by the amount of TPCs as appears to be the case.

N.

Huh, maybe it’s no longer per TPC? let me check.

Why does the profiler introduce the ‘CTA’ when the rest of the documentation always uses blocks?

Because all of our internal documentation refers to CTAs and the term “block” is a CUDA invention…

(CTA = cooperative thread array, which probably makes a bit more sense than block when you know how they work, what the capabilities for sharing data are, etc.)

yes, cta_launched is per TPC. why this isn’t in the docs anymore I have no idea… (and that’s something I’ll find out)