What's the reason for max. 512 threads per block ?

What’s the reason for max. 512 threads per block ? when SM is capable of holding 768 active number of threads ?
Was it the programming model decision ?

Thanks

Are you asking: Why would anybody create a block with 512 threads since that will limit the number of blocks on the multiprocessor to 1 and thus limiting the number of possible threads I can have on the MP?
Maybe the max number of threads somebody should create in a block is 384 since that will allow 2 blocks and 768 threads to run on the MP.

I was curious what limited max. number of threads per block from h/w, or architectural point of view…

Thanks for the post.

Possibly the limited size array for holding warp states in the hardware thread scheduler. At most 24 warps (from up to three thread blocks) can be queued for execution on any given SM.

Why they limited the block size to 512 threads instead of 768 then, I don’t know. Possibly because 512 is a handy dandy power of 2.

And then there’s the limited size of the register file (8192 for older chips, 16384 for newer chips). Either way, divide by 512 threads and that doesn’t give you that many registers to work with per thread.

Christian

This limitation of 512 threads per block can actually be a performance bottleneck. I have some algorithms that need all of shared memory, so it’s only one block using about 15K of shared. But I’m limited to 512 threads, which sucks, especially since I have registers left over on G200.
I’d love to have a full 1024 threads running in the block but this 512 threads-per-block is the limitation.

An example would be a ZIP compressor or decompressor. We need as large a shared memory cache as possible to make lookups into our last expanded text, so running multiple blocks is a bad idea. But we also do a lot of global memory reads and writes, both to write answers but also to fetch plaintext that’s already been evicted from our shared memory cache… so for those global memory accesses, we want lots of threads… a full 1024 is great. But nope, you have to use 512, and therefore you become memory latency bottlenecked, even though you have registers and warps to spare.

I have exactly the same problem. I have a kernel where 1024 threads would be very naturally, and upon reading the first things about GT200 was happy to see 1024 threads per block, but then it turned into a disappointment when I found out the max amount of threads per block stayed 512, so I am left with a very non-elegant kernel. 512 is 2^9, so it is not like they ran out of bits in 1 byte. Please NVIDIA, give us 2^10 threads per block in 2.1 :)

Actually, when writing source code that’s compiled into transistors, you can have any bits to an int you want. The other thing that happens when compiling to hardware is that all the logic and control structures (loops, if statements, switch statements) get flattened, especially when you want the action to be performed in 1 cycle (or another deterministic amount). A function that takes a 9-bit integer can have 512 paths through it, while a function that takes 10-bit integers will have 1024 paths, doubling its physical size. Not all functions balloon so exponenentially: the complexity of most is indeed linear in relation to bit-width (and others are n*log(n), n^2, etc). But in essence the difference in transistors between 16 warps per block and 32 warps per block may have been significant.

Or at least… more significant than the performance gain from going from 512 threads to 1024. (Honestly, this should be pretty minor. Plus, I’ll bet $5 that if I take a look at your algorithm I could find a use for those faster-than-smem registers that’ll have you running 128 threads and getting two-fold better performance.)

It is a scan of an array of 2048 values. 512 threads lets you scan 1024 values, so I have written a dirty version that does 2048 values ;) I was not talking about performance gain, but about readability & maintainability :P

I think even 512 is an overkill and not good to use in practice unless you have a lot of thread divergence within block.

Note, that longer thread blocks can often be strip-mined into shorter ones. This applies to reduction, for example.

What is the advantage of 512-thread blocks when there is thread divergence? In fact, is there a difference, I’ve wondered, between running one 512 thread block and multiple smaller blocks? (I sometimes get the impression that people think several blocks will be scheduled more efficiently than one block, but I’ve never heard a justification, and you seem to outright contradict that.)

You are talking about varying 2 parameters: thread block size and number of thread blocks. However, there is a third parameter — amount of work done per thread.

Keeping work done per thread constant does not always work. In that case, when doing, say, reduction, running more shorter blocks yields more shorter reductions, i.e. solves slightly different problem.

To handle this, instead of increasing number of threads you increase the amount of work done per thread. Say, if you cut the thread block in half, you split every operation in two. “float a, b;” becomes “float a[2], b[2];”, “a+=b” becomes “a[0]+=b[0]; a[1]+=b[1];” and so on. This is where branches may pose difficulty.

Also, 512-thread blocks may indeed be less efficient on pre-GT200 GPUs compared to 256-thread blocks, as only 1 such block fits on multiprocessor (occupancy 66%) versus 3 in the other case (occupancy 100%). I guess some kernels may be over-sensitive to occupancy and run slower.

Thread divergence is a problem within warps, not within blocks. Changing the amount of threads in a block should thus not have any result on divergence, unless of course the larger amount of threads allows you to better divide the work. But unless you can find a good use for the extra threads it will only waste registers and thus limit the amount of concurrent blocks that can execute on a SM.

I wonder how the scheduler treats warps that belong to the same block and those that belong to varying blocks. Is there any distinction? (eg, in case of a stall, is it quicker to schedule-in an alternate block’s warp or the same block’s warp?)

The architectural discrepancy between max number of warps per block and max number of warps total, suggests that the scheduler does process warps differently, depending on their grouping as blocks.

Also, 512 could simply be the largest power of two that fits in the max of 768 threads on G80.

If you strip-mine 8 diverging warps into 2, divergence in warp becomes a problem.