Here my own answer from a few days ago:
(Some things have become more clear to me know, but some of this information might be usefull to others or to myself in the future (for some reason)
All of my questions are pretty well explained in:
“CUDA C programming guide” (Which came with CUDA Toolkit 4.0)
The model behind CUDA so far is:
Single Instruction Multiple Threads.
Each thread has it’s own instruction pointers and it’s own registers
A thread is part of a warp.
A warp consists out of 32 threads.
Each thread can inprinciple logically follow it’s own path, this includes
paths/branches based on data and indexes, like the build-in indexes, pretty
How the hardware exactly works remains secret, however I suspect the
hardware examines the instruction pointers of all threads.
If the hardware detects that the threads are all or most of them are
pointing to the same instruction, then I suspect that instruction gets
executed once if possible,
or perhaps it’s execute in parallel for different data items… but there is
probably some kind of efficiency going on deep in the hardware.
Perhaps it can execute c = a * b for all threads just once if c, a and b are
a shared location, it makes no sense to calculate that 32 times so perhaps
it can then
calculate that just once… constant calculations like c = 4 * 5 are
So why it’s exactly called “single instruction multiple thread” is a bit
mysterious. Perhaps it’s simply a form of sharing “binary instruction
codes/space” so that they same
program does not have to be copied hundreds of times, which makes sense.
Only the instruction pointers, and the data items are unique and duplicated.
Anyway, the hardware tries to execute the threads in a warp in parallel, if
it can’t do that for some reason then some of threads are serialized, it
might then still
execute some of those perhaps in parallel or maybe not at all, that’s not
entirely clear, but I think it can do that.
After all the serialized divergent branches are executed it comes back
together at a “parallel continuation point” where all the threads are
executing in parallel again.
It also achieves this by stalling threads which have to wait on the other
threads within the warp. So it automatically serializes (de-parallizing) and
synchronizes (re-parallizing) the threads.
When an entire warp gets stalled for some reason it could switch to another
warp which might be ready to run and try to execute that.
Stalls can happen because of serializing, memory accesses, and registers
being used in previous and current instructions (write/read dependancies)
and also synchronization instructions.
The weirdness for the programmer starts with trying to achieve maximum
performance for stall situations.
There seem to be two ways to maximum performance:
Maximum the number of threads for a thread block.
Maximum the number of thread blocks.
A thread block is not the same as a warp.
A thread block can consists out of warps.
So a thread block specifies how many threads there are in the block. There
could be a certain maximum, say 512 or 1024 or just 2 or 10 or any number.
The warp schedular/the cuda system will divide the number of “threads in the
thread block” by the “warp size”.
The warp size is fixed at 32 for all gpu’s.
So the number of “possible” warps for a thread block is:
“Threads in thread block” / “warp size=32” = number of warps.
So at first glance it would seem smart for the programmer to make the number
of threads in a thread block as high as possible, so that the number of
warps become as high as possible.
But there is a second weird option:
The programmer could also choose to maximum the number of blocks.
To me this more or less seems the same thing.
Suppose a thread block has 32 threads which is exactly the warp size then
multiple of these blocks could be made/issued.
As far as I can tell the warp schedular/gpu system will still switch from
stalled warps to other warps in other blocks ?!?
(Or perhaps not, I am not sure yet)
^ This is a bit vague still…
So this seems a bit weird… why give the programmer two different
options/possibilities/setups which more or less do the same thing ?
What’s the difference between:
- 1024 threads in a single block
- 1024/32 blocks of each 32 threads.
One possible answer might be that “blocks” can be distributed over multiple
“multi-processing units” (which is the basic piece of hardware/core inside a
While a single thread block can only reside on a single multi-processing
So perhaps that’s the answer.
The programmer or program/application will have to examine the hardware, it
will have to look at how many “multi-processing units” are available.
And then it will have to distribute it’s workload over “multi-processing
However the trick is to keep “stalled” multi-processing units busy as well.
So to be able to do that a single thread block must have multiple warps, in
case a single warp gets stalled.
(It’s also wastefull if the number of threads is not equal to a multiple of
warp size, since then a few threads will do nothing).
The programming guide has more information about how much cycles it takes to
execute the warp in it’s entirety.
It has to do something with how long it takes to execute serialized
instructions. Each serialized instruction could take for example 4 clock
cycles, it depends a bit on the instruction
clock cycle cost.
Cheaper gpu’s only execute one instruction per clock cycle per multi
processor unit, expensive gpu’s can execute two instructions or even more
but this also makes latency hiding
more difficult because it can execute more/faster it most also have more
data/threads to switch to when latency happens, but that is logical in a
way… more processing power so it can handle more data, as long as it also
has more memory ofcourse.
When wraps still because of memory requests/latency it is claimed to be a
cost of 600 to 800 clock cycles.
The question is how to distribute. There seems to be no clear answer.
There is a “spreadsheet calculator available” which is for excell, this is
kinda crapty because I do not have excell installed, I don’t really need MS
Office, but now I must install it just
for this one little tool. Anyway the answer giving is to experiment with
this/to experiment with settings of kernel execution.
In the future the answer might change a bit of gpu’s can actually execute
multiple applications at the same time, like cpu’s do/can… I am not yet
sure if gpu’s can actually do that, and how it would impact the answer.
For now my advice would be simple:
Try to maximum the number of overal threads for the
application/kernel/algorithm/data first, so maximize the parallism of it,
put this number in a variable called:
Then query the hardware/gpu/device and figure out how many
“multi-processing units” it has, then simply distribute the total ammount of
threads over the mpu’s:
ThreadsPerMPU = TotalThreads div MPUCount;
TotalThreads = 2000;
MPUCount = 17; // unrealistic number but it’s the idea that count.
ThreadsPerMPU = 2000 div 17 = 117 threads per mpu.
However there is also a remainder so that can be calculated as well:
ThreadsForLastMPY = 2000 mod 17 = 11 threads for last mpu.
However there is no last mpu, there are only 17 and not 18.
So the answer is probably to “ceil” the 117, plus a little bit to 118 and
then recalculate how many threads there would be to get an idea
of wasted threads (ceiled threads * mpu count):
118 * 17 = 2006 threads, 6 threads would be wasted, not so bad in this
example. I guess the maximum number of threads wasted is 31 since warp size
is 32 so it’s (warp size-1).
However this assumption of waste only holds if the number of mpu’s
multiplied by the warp size is actually lower then the number of total
threads, otherwise there will ofcourse
The ammount of threads the system can handle is mpu count * warp size = 17 *
32 = 544.
So if there would be less than 544 threads in total, then some threads on
the warps would so nothing.
However there is more to it then meets they eye perhaps.
Let’s go back to the 117 threads or 118 threads per mpu.
This must now be divided by the warp size to get an idea of the last warp
being not perfectly 32.
117 div 32 = 3,656
117 mod 32 = 21 threads would be wasted on last warp.
I am now not sure what cuda does… would it allocate 3 or 4 warps ? Perhaps
4 but it could also be 3 so I am not sure.
Let’s see what happens for 118:
118 div 32 = 3,687
118 mod 32 = 22 threads would be wasted on last warp but at least it fits on
At least there will be 3 warps which will offer some latency hiding but
probably not a lot… unless the kernels execute a lot of instructions
before stalling again.
So the conclusion is pretty simple/easy:
CUDA needs a lot of threads to perform well, or it needs a lot of
instructions to perform well.
The more threads the more memory it needs, so graphics cards will need to
get a lot more memory, but the memory system must also become faster so more
Generally it seems the more bandwidth the hotter the card, is there a
logical correlation/explanation ?!? Also the higher the bandwidth the higher
the price it seems, but this
could be artificial pricing.
The last question remains:
What are grids ?!? I am not exactly sure, but perhaps this comes into play
when there are multiple devices or so… but as far as I can tell kernels
can only be executed
on a single device… so I kinda forgot what grids are for… they at least
offer more subdividing/distribution possibilities.
For now it escapes me what the usefullness is of grids… I think it’s
coming back to me… it might have to do with global memory only, or
Blocks can share memory between each other. Grids cannot.
So grids are probably a further division for adding computation power at the
cost of no-sharing between grids themselfes, except via the slower global
Shared memory is something special, I think it was somewhat faster then
All the threads share the “fast” memory. The “fast” memory is “registers”
and “shared memory”.
The more registers a thread/kernel needs the less threads there can be.
The same applies to shared memory.
Perhaps threads also have a limitted “local addressing space”.
This is kinda weird… the threads seem to be able to address 16 KB of local
The local memory which is behind this local address space is actually also
as slow as global memory.
So I suspect this has something to do with some kind of thread local address
space addressing limitation, which is really weird.
Perhaps local memory is first allocated to registers and when that spills
over it ends up in local address space ?!? But why not spill directly to
global address ?!?
That way it could have addressed a lot more… this is weird. But probably
has to do with hardware saving costs or so… Probably saving address lines
Worst case for serialization is probably something like 24 cycles or so.
(the guide mentioned something like that). It’s a bit different per gpu and
what kind of instruction probably.
Overall my estimation would be that serialization is not much of a problem,
and distributing threads among thread blocks (mpu’s) is ok (as described
(Instead of all threads in a single block (one mpu=bad for multiple mpu
systems) or vice versa, a single thread for each block (1 thread per
mpu=generally bad for all gpus) )
Threads are also capable of requesting their own memory addresses.
However there is a catch to this, multiple catches actually.
For best performance the variables need to be of a fixed size: 4 bytes, 8
bytes or 16 bytes.
Furthermore they should be aligned on address boundaries like: 4 bytes, 8
bytes, 16 bytes, 32 bytes, 64 bytes, 256 bytes, etc.
Consult documentation for the exact alignment.
Furthermore for maximum thread execution speed.
Thread N in a warp should use data item located at: Memory[Alignment + N *
This is were it is most similar to SSE… it then behaves like a vector, and
can do stuff/memory stuff probably in parallel.
Sometimes it can also do special branching operations which is called
This depends on how many threads want to execute different branches.
The weird thing was that if the number of threads wanting to executing
different branches was low it will actually predicate, otherwise it would be
I wonder if this was perhaps a mistake in the manual… or perhaps it’s
because the predicate instructions are limited to a certain ammount of
threads. If it goes over then itâ€™s not possible, that could explain it. So
less then 4 or 7 and it would predicate otherwise it would do serialization.
It is possible for the threads to request their own arbitrary memory
locations, fully random but then it will probably not execute so fast.
Each of these requests is then treated as a single memory transaction. So
this will then greatly slow down the memory system.
The cheaper gpu’s/hardware seem to have no caches (?) the expensive gpu’s
seem to have some caches, so these caches might hide some of the bad memory
But for very large data sets these caches probably offer little performance
gain. Programmers should assume no caches present/available when programming
for large data sets !
That sums it up pretty nicely concerning the way it works and performance.
The more expensive gpu’s also have more compute capabilities (2.0), which
are special things like double precision performance, gpu to gpu transfers,
unified addressing (for easier programming host pointer=device pointer), and
probably more advanced memory transfer functions though not sure if that is
2.0 only or lower too.
Not all gpu’s in the 5xx series are compute 2.0 as far as I can tell some
are 1.2 or 1.3. So this means watch carefully which compute capability each
gpu has, it’s not the same across the entire gpu serie.
Compute capabilities can be found on the nvidia website by searching for
Hmm first of all there are further problems with limiting the ammount of
threads per block to the max thread per block limitation.
Second of all somebody wrote that the memory threads per block are assigned
the less registers available for the threads which is logically since they
need to share/distribute the available registers among each other.
So the “registers used/available” also becomes part of the “equation”.
So if a kernel uses lot’s of registers then it might be smarter to reduce
the number of threads per block.
I wonder what the algorithm is for the spread sheet calculator, perhaps it
uses a solver.