Block/threads and stuff...

eyalhir74 · September 12, 2008, 8:05am

Hi all,
I’m new to nVidia and have a couple of question, which I couldnt find answer for in the CUDA docs.

I understand that a block can have at most 512 threads. So why is the limit
on a dim3 block is 512 x 512 x 64 ?
What mostly controls the performance of the GPU? should I strive to have
as many blocks as possible? or as many threads?
Should the kernel methods be as compact and simple as possible in order to gain as much performance?

thanks
eyal

E.D_Riedijk · September 12, 2008, 9:40am

you can have a block of (1,512,1) if it is convenient. Or (4,4,64). It is just a matter of what is convenient for you in your code (no need to calculate indices in a cumbersome way)

that depends. If you use shared memory, it can be that as many threads per block as possible makes extra efficient use of the shared memory.

In general you can say, have each thread do as little work as possible, and have lots of threads (spread over a lot of blocks). A large number of blocks helps to enable your code to scale good on future GPU’s

That really depends on your problem. Most of the times, kernels are bound by memory-bandwidth. That means that if you can read & write less data by doing more calculations, you may end up with a faster kernel. But if your kernel starts to use a lot of registers, you get low occupancy which may lead to more trouble with latency.

Unfortunately, I must say, that after 1 year doing CUDA stuff, I still find myself mis-predicting performance when I try to optimize things. Often trying things in different ways and benchmarking is the only way to go.

eyalhir74 · September 12, 2008, 10:56am

Hi,
Thanks a lot for the fast answer.
One more question please (at least for the time being :) )
If i create a grid of 65K * 65K blocks each with 512 threads that a hugh number
of threads. Now on regular CPU there’s not much point in creating more threads
then the number of cores. How does the GPU manages such an amount of blocks
and threads?
Is it reasonable to create 65K * 65K of blocks?
If I manage to logically devide the algorithm only to N x M blocks ( N and M being
small numbers) does it probably mean the code will not use the GPU capabilities
and will probably wont have a boost over CPU?

Turned out to be more then one question :)

thanks in advance
eyal

MisterAnderson42 · September 12, 2008, 11:18am

Because the GPU has extremely specialized thread scheduling hardware that offers zero overhead to switch running threads on an MP and virtually zero overhead to launch a new block on an MP when an existing one finishes.

On the CPU, thread scheduling is done in [b]software[/n] (the OS) and every thread switch requires a painstaking and slow saving and restoring of the state of the processor (called the context switch).

Well, that depends on your application, of course. Consider this, though: if each block were 512 threads and wrote 4 bytes (forget for the moment that you’d need more than 4GiB of memory for this to work) the runtime of that kernel would be ~87.95s assuming you can sustain 100 GB/s of bandwidth (possible on GTX 280).

It is all application dependent, of course. I can show you the speedup results vs. the number of threads for my application and you can be the judge: The x-axis plots the number of threads in thousands and the y-axis plots the speedup observed for HOOMD’s (http://www.ameslab.gov/hoomd) speedup on the indicated GPU vs a single core of an Opteron 285 CPU. This is running a polymer model in HOOMD, but I see similar speedups across the board.

Of course there will be exceptions, but the general rule is that optimal GPU speedups are obtained with 10’s of thousands of threads (since the GPU can keep this many threads running concurrently).
single_cpu_speedup_fig.pdf (18.8 KB)

theMarix · September 12, 2008, 12:14pm

Keep in mind that N x M must at least be the number of multiprocessors, as each block will be executed on exactly one. To keep scaling on future hardware you should have more blocks than multiprocessors. In addition, the maximum number of threads per MP is 768 and cannot be exploited with only one block per MP.

eyalhir74 · September 12, 2008, 1:22pm

Hi,
Again many thanks :)
From your experience how much work/algorithm change it is to move to GPU?
I have an algorithm that has 8 loops one inside the other (and if GPU will give
us ~10-20x times boost - we’ll probably add a few more loops :) ).
I can have the two outer loops running in different threads (on the CPU) and therefore the trivial GPU code, was to have a N x M grid (N is loop 1 and M is loop 2). However most of the iterations happen inside the inner loops ofcourse, and those Im not sure I can parallel (at least not easily)
Furthermore the dataset I use can get up to 16GB per those 8 loops. Using the shared memory of 16K that the GPU offers per block, is somewhat very very low.
Any suggestions? thumb rules?

thanks again :)
eyal

Topic		Replies	Views
Optimization problem how many blocks/ threads... CUDA Programming and Performance	1	1908	July 9, 2010
finding the best number of threads per block CUDA Programming and Performance	3	7862	January 29, 2010
What's the reason for max. 512 threads per block ? CUDA Programming and Performance	14	9528	November 10, 2008
Limit to Number of Blocks? Noob Question CUDA Programming and Performance	4	3007	May 16, 2008
thread vs block CUDA Programming and Performance	1	1377	July 9, 2009
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5828	December 12, 2009
Confusion about thread per block CUDA Programming and Performance	1	799	July 24, 2009
CUDA software and hardware mapping CUDA Programming and Performance	5	14699	February 21, 2009
Block size and grid size CUDA Programming and Performance	5	8400	April 27, 2009
newbie, microprocessors CUDA Programming and Performance	7	4739	March 26, 2008

Block/threads and stuff...

Related topics