Block/threads and stuff...

Hi all,
I’m new to nVidia and have a couple of question, which I couldnt find answer for in the CUDA docs.

  1. I understand that a block can have at most 512 threads. So why is the limit
    on a dim3 block is 512 x 512 x 64 ?
  2. What mostly controls the performance of the GPU? should I strive to have
    as many blocks as possible? or as many threads?
  3. Should the kernel methods be as compact and simple as possible in order to gain as much performance?


you can have a block of (1,512,1) if it is convenient. Or (4,4,64). It is just a matter of what is convenient for you in your code (no need to calculate indices in a cumbersome way)

that depends. If you use shared memory, it can be that as many threads per block as possible makes extra efficient use of the shared memory.

In general you can say, have each thread do as little work as possible, and have lots of threads (spread over a lot of blocks). A large number of blocks helps to enable your code to scale good on future GPU’s

That really depends on your problem. Most of the times, kernels are bound by memory-bandwidth. That means that if you can read & write less data by doing more calculations, you may end up with a faster kernel. But if your kernel starts to use a lot of registers, you get low occupancy which may lead to more trouble with latency.

Unfortunately, I must say, that after 1 year doing CUDA stuff, I still find myself mis-predicting performance when I try to optimize things. Often trying things in different ways and benchmarking is the only way to go.

Thanks a lot for the fast answer.
One more question please (at least for the time being :) )
If i create a grid of 65K * 65K blocks each with 512 threads that a hugh number
of threads. Now on regular CPU there’s not much point in creating more threads
then the number of cores. How does the GPU manages such an amount of blocks
and threads?
Is it reasonable to create 65K * 65K of blocks?
If I manage to logically devide the algorithm only to N x M blocks ( N and M being
small numbers) does it probably mean the code will not use the GPU capabilities
and will probably wont have a boost over CPU?

Turned out to be more then one question :)

thanks in advance

Because the GPU has extremely specialized thread scheduling hardware that offers zero overhead to switch running threads on an MP and virtually zero overhead to launch a new block on an MP when an existing one finishes.

On the CPU, thread scheduling is done in [b]software[/n] (the OS) and every thread switch requires a painstaking and slow saving and restoring of the state of the processor (called the context switch).

Well, that depends on your application, of course. Consider this, though: if each block were 512 threads and wrote 4 bytes (forget for the moment that you’d need more than 4GiB of memory for this to work) the runtime of that kernel would be ~87.95s assuming you can sustain 100 GB/s of bandwidth (possible on GTX 280).

It is all application dependent, of course. I can show you the speedup results vs. the number of threads for my application and you can be the judge: The x-axis plots the number of threads in thousands and the y-axis plots the speedup observed for HOOMD’s ( speedup on the indicated GPU vs a single core of an Opteron 285 CPU. This is running a polymer model in HOOMD, but I see similar speedups across the board.

Of course there will be exceptions, but the general rule is that optimal GPU speedups are obtained with 10’s of thousands of threads (since the GPU can keep this many threads running concurrently).
single_cpu_speedup_fig.pdf (18.8 KB)

Keep in mind that N x M must at least be the number of multiprocessors, as each block will be executed on exactly one. To keep scaling on future hardware you should have more blocks than multiprocessors. In addition, the maximum number of threads per MP is 768 and cannot be exploited with only one block per MP.

Again many thanks :)
From your experience how much work/algorithm change it is to move to GPU?
I have an algorithm that has 8 loops one inside the other (and if GPU will give
us ~10-20x times boost - we’ll probably add a few more loops :) ).
I can have the two outer loops running in different threads (on the CPU) and therefore the trivial GPU code, was to have a N x M grid (N is loop 1 and M is loop 2). However most of the iterations happen inside the inner loops ofcourse, and those Im not sure I can parallel (at least not easily)
Furthermore the dataset I use can get up to 16GB per those 8 loops. Using the shared memory of 16K that the GPU offers per block, is somewhat very very low.
Any suggestions? thumb rules?

thanks again :)