I wish to start a maximum of threads on a card…
As I understand 4 coprocessors,
and with blocks I can not understand
Prompt
or where to look
programming guide has it all
Well, the maximum grid size is 65535x65535 and the maximum number of threads per block is 512, so you can execute approximately 2e12 threads. Definitely only try this on the linux console (no X) or with a 2nd display in windows, since it is bound to take a very long time even to launch an empty kernel.
Anyways, as DenisR said: the programming guide has it all (and is very well written). Especially relevant is where it says launching 100’s of blocks is needed to reach optimal performance.
Well it cleanly theoretically…
But in occupancy calculator I see this:
for G84
Active Threads per Multiprocessor
512
Active Thread Blocks per Multiprocessor
1
Multiprocessors per GPU
4
Unless it will not turn out here so:
threads = multiprocessors per GPU * Active thread blocks * Active threads per Multiprocessor
threads = 4 * 1 * 512;
threads = 16384
Or I am not right?
The GPU can have more than just active threads loaded at the same time. The GPU schedules time between active threads and those that are waiting for execution, and context switches are very fast. So while only a certain number are actually running at one time, many many more can be waiting to run. Generally you want to balance between the time that it takes to read from memory and the time that instructions take to process on the GPU so that while one set of threads (a warp) is waiting to read or write to RAM, another set are running on data that has been loaded.
Take a quick peek at section 3.2 of the programming guide.
Umm, 4 * 512 = 2048. But 512 thread blocks is not the way to get the most number of threads actively running on the device. Maximum occupancy is 24 warps per multiproc => threads = 24 * 32 * multiprocessors
In your case with multiprocessors = 4, threads = 3 072
But running exactly this number of threads (or the number matching the occupancy of your kernel) isn’t going to be the most efficient. Consider it a lower bound. To really get into the linear performance region of the card, you need 2-3 times this number of threads. The GPU is built to swap new thread block in instantly when one completes. So if you only fill your GPU with exactly 3072 threads, any blocks that finish faster than the others will result in “wasted” GPU time as a multiproc sits partially idle.
As I said though: this calculation is still useful as a lower bound to get decent efficiency.
Oops :D
Oh, I there that have forgotten to write down that in the formula