Grids and No# of processors. Grids and No# of processors relation?

As we know `<<< Dg, Db, Ns >>>’ is required for calling Kernel to execute on GPUs (or device), where Dg meant for grid size, Db for block size (and number of threads), and optional Ns for memory allocation.

:argh: However I would like to know how the number of processors and grids are related? :argh:

For example, I am using Tesla C870. which has 16 multiprocessors with each multiprocessor having 8 processors. total 128 processors. I want to scale my program my testing it on 16, 32, 48, and so on up to 128 processors. How can I archive this with CUDA programming? :argh: :argh: :argh: :argh: :argh:

( I thought that there must be some relation with grid and/or block size used in the program with the number of processors in the GPU card.)

Kindly let me know.

With Regards,
Satakarni

The effect of hardware MP count is mostly hidden from you… you don’t really have to analyze the way the device maps your blocks to multiprocessors. You CAN if you want to by querying the device properties, but you don’t have much control and in fact you shouldn’t try to do too many games.

The general way to allow scaling is to make sure to have lots of blocks in your grid. The overhead of a block launch is small (but admittedly nonnegligable). But more blocks gives the device better granularity for running them in parallel.

So you may have 16 multiprocessors, and think that using 16 blocks is best… but maybe not! If you used 32 blocks, perhaps (depending on your kernel’s resource use) two blocks can run on each processor simultaneously, giving a speed boost. And if you do use 16 blocks, and 10 of them finish and 6 are still cooking, 10 multiprocessors will sit idle, waiting for the remaining 6.

If you have 400 blocks, this is all solved for you, you’ll get finer grained block scheduling… they just get queued up and every MP stays busy until the very end.

More blocks is also robust to device changes. If you have 16 blocks, it makes a big difference if your device has 12 MPs or 16, it could easily be half speed on the 12 MP device because 4 blocks have to wait. But if you have 400 blocks, your idle MP overhead is negligible and you’ll be ussing both devices to their full advantage. And even when tomorrow Nvidia releases their new 72 MP monsterboard, your high-block project will be ready to run efficiently.

Now you did ask another question, how can you TEST your program running on different numbers of MPs? The answer is you should buy hardware with a variety. :-) I’m sure the GPU firmware could be modified to use only a subset of your MPs but that’s not something we have easy access to.

Thank you, sir. It is a wonderful explanation :clap: . So kind of you.

Sir, now let me put my understanding of your article this way. Please correct me, if I am wrong.

(I would be mainly using regular matrix multiplication program to analyze the scalability of the CUDA with GPU in comparison with MPI with CPU. )

Each block (i.e., = Total No# of blocks/ Total No# of MPs) will run simultaneously on

the processors of each MP. Otherwise, in simple words, say block 0 runs on MP 0, so on up to block 15 on MP 15, if we have 16 blocks launched on to the device with 16 MPs. (Ideally at least )

Is that correct? (My understanding :wacko: of this concept is very imp for my experimentation)

Close. Ideally, you can run up to 8 blocks concurrently on each MP, or more meaningfully 24 warps. So the hardware is capable of handling 16 MPs * 24 warps/MP * 32 threads/warp > 12,000 concurrent threads.

If you run too few threads, you will notice a “stair-step” look to the execution time graph as adding more threads requires no extra time. Once you get over ~12,000 threads this turns into a linear performance graph. There is no overhead for switching threads so adding more just linearly increases the execution time (assuming all threads perform a similar amount of work).

So it seems that 768 threads max could run concurrently on a multiprocessor, according to the CUDA Technical Spec. But in fact, a multiprocessor is composed of 8 processors (maybe not for all GPUs, but for example on the TESLA C870). And physically, I suppose only 1 thread is executed “simoultaneously” on 1 processor… except if my understanding of the architecture is bad and 1 processor contains many SIMD (but I don’t think so !!!)
So could you give me the meanning of the word “concurrently”?

concurrently means that all those threads are resident in the registers of the MP and active in the scheduler. The MP does not run a single warp to completion at a time, it interleaves them to avoid register read-after-write hazards and to do useful work in some threads while others are waiting for their global memory reads.

So, if each line is a clock cycle (actually 4 since warps are processed in 4 clocks) the execution stream on a single MP might look like this:

Warp 1 requests global memory read -> enters wait state in scheduler
Warp 2 does arithmetic on a reg
Warp 3 does arithmetic on a reg
Warp 18 does arithmetic on a reg
Warp 12 does arithmetic on a reg
Warp 2 does arithmetic on a reg
Warp 1’s global memory read is ready, copy value to register
Warp 16 does arithmetic on a reg
Warp 20 does arithmetic on a reg
… and so on

Thx,

my understanding was not so bad finally…

A last question about grids.

Grids seems to exist only due to the phyisical difficulty (and the cost) to extend the shared memory (and its schedulers) at more than the max number of blocks usable per grid, isn’t it?

Do you know why it is impossible to run concurrently different kernels in different grids?