number of threads and registers

I am a little confused about the numbers.

In the programming guide, appendix A.1 general specifications:

  • The maximum number of threads per block is 512
  • The maximum number of active threads per multiprocessor is 768

If I am not wrong, a block is executed on a single multiprocessor. Therefore, why are these numbers different ? I guess it is because I can run for example 2 blocks one having 512 other having 256 threads, right ?

What happens if a block contains more than 512 threads e.g. 64x64 size ? Are they scheduled to run serially ?

Also, it says:

  • The number of registers per multiprocessor is 8192

Can any processor access all the registers ? or it is like each processor has 8192/8=1024 registers ?

Maximum number of threads per block is 512. You cant run a configuration with more than that much threads per block.
However, you can run your kernel with a number less than that.

In my knowledge, 512 is a hard limitation and is NOT a limitation to a particular dimension. So 64x64 canNOT be possible. Better check with some1 more knowledgeable OR read the manual again for clarity.

The basic unit of parallel execution is a “warp” and there are 24 warps per multi-processor – 24*32 = 768 threads. That is where the limitation comes from for a multi-processor. REmember, we r talking only about “active” threads – i.e. threads which are scheduled and running.

You canNOT run one bblock with 512 threads and another with 256 threads. The number of threads per Block is a CONSTANT for a kernel and is available via “blockDim” variable to programmers.

The registers of a multi-processor are partitioned among threads. The number of registers also forms a limiting factor on the number of active threads of execution on a multi-processor.

OK, I understand 512 threads per block is hard-limit. So 64x64 is not possible (it raises a runtime error).

I guess CTA=Block, right ?

Suppose I have grid(2) and block(512), that means each block will be mapped into different multiprocessor since 512x2 > 768 and a block must be in a single multiprocessor (due to shared memory or synchronization) ? So, if there is no problem with synchronization etc. it is better to have grid(4), block(256) since it will utilize more resources ? I am still wondering why it is designed such that a single block cannot utilize all of the resources of a multiprocessor.

About registers, I get your answer as when I have a single thread I have 8192 registers ?

It’s best to have at least 100 blocks at a minimum. The device will run more than one block on a multiprocessor and take advantage of memory/computation interleaving. The optimal size of the block depends on your application and should be determined by benchmarking.

I do not think it is possible to have 1 thread with 8192 registers. I would guess that the largest you could possibly do would be one warp (32 threads) with 256 registers each totaling 8192. However, I don’t recommend ever attempting such a kernel, the compiler will probably crash. Most kernels will be in the range of 5 to 25 registers per thread.

Yes. CTA is a block. CTA stands for “Cooperative Thread Array”

It is a good thought indeed. Probably the designers were expecting to run multiple active blocks on a multi-processor.

Yeah true. But the hardware could also run them one by one on a single multi-processor. We are NOT supposed to assume it. But yes, we can just assume that the hardware would be smart enuf.

Yes, But note that the shared memory gets split among the blocks. With 1 block in a MP, you ahve 16K of shared mem for that block. With 4 blocks in an MP, you ahve only 4K of shared mem per block. I am just assuming that all 4 blocks are being active on a single multi-processor.

A single block can use all resources of a multiprocessor as far as I know.
If you use 16k shared memory, 512 threads and 16 registers per thread (or 256 vs 32), you are using all resources of a multiprocessor in your block.

No. A Multi-processor has 24 warps equalling 768 threads. A block with 512 threads can only achieve 16/24 occupancy – which is 66%. The topic author was right and it is really an interesting thought.

No a multiprocessor has 8 ALU’s, 8k registers and 16k shared memory. You will be using all of the resources.
Occupancy is not a resource, and also does not always matter. Using a lot of resources in your kernel makes your occupancy go down in general.

If you are not counting the 24 warps as a resource, your theory is fine…

Anyway, the original thought from the topic owner on 768 threads is still interesting…

a warp is just some way cuda splits up your threads., resources are things you can eat ;) But to come back to the 768 threads, I think that for a lot of parallel algorithms the number of threads per block needs to be a power of 2 (scan, reduction, etc.) maybe it was too costly to go to 1024 active threads per MP?

It is actually really annoying, since 1024 threads per MP (and block) would have given me lots of benefit in most of my kernels. For me personally I try to design for 256 threads per block, and only when I use too much shared mem / registers for 3 blocks per MP I go to 512 threads.

I am really novice in this (data-parallel computing) field but at least it seems to me plausible that I may have 768 threads in a single block with no global memory access during processing (using only registers and shared memory). This is why I raised this question.