Registers and Shared Memory question

Hi–I am pretty much a newbie with CUDA development. I have been reading all the posts fairly regularly and I have found them to be very instructional. They have answered some of my questions. I have done several basic kernels at this point, just to get some familiarity and start to answer questions or at least start to be able to ask some questions. There are a couple things I don’t fully understand about registers and shared memory I am hoping someone can help me with.

Registers:

The CUDA programming manual state that

This says nothing concerning the number of blocks to be run in a given multiprocessor. What happens when time slicing kernels if the total number of registers across blocks is greater than the given multiprocessor’s registers? Are they temporarily transferred to global memory? Off hand, this would not seem to make sense, since time slicing is to some degree governed by memory writes/reads. This would seem to compound the I/O bounding issue (relatively long global memory access delay) that caused the block to be switched in the first place.

Shared memory:

The shared memory question is very similar to the register question. If the total number of blocks requires more shared memory than is allocated to each multiprocessor, what happens when one block is switched out for the next? I would assume it would transfer to global memory. But again, this would seem to compound the I/O bounding issue that caused the thread to be switched out.

Any information would be helpful. I have not found answers to these questions in the programming guide. Thanks in advance for any help.

The answer to both questions is: No, a block in progress will never be “swapped out” to make room for another one during execution. Multiple blocks are interleaved only if there are enough chip resources to load all of them on the multiprocessor simultaneously. So, if your kernel uses more than half of the registers or the shared memory on a multiprocessor, then only one block will run at a time.

In terms of performance, though, it is better to first reduce the amount of global memory accesses, and then to ensure those accesses are in an optimal pattern. Then worry about reducing register and shared memory usage, as long as it does not conflict with those first two goals.

seibert,

Thank you for your reply. I guess that should have been obvious to me, but now I know the answer instead of just guessing.

On another topic, it would seem that if you know a certain kernel will be I/O bound (for global memory accesses) because the computations are extremely short and simple, you would want the maximum number of threads allowed per block, in order to coalesce the maximum number of global memory reads/writes per block. In this case, if a smaller than maximum number of threads are used per block, it would seem there would be more “wait time” because of time slicing smaller blocks (and therefore, global memory accesses). Would you agree with that?

Given that the memory bus on the GeForce 8 series is up to 384 bits wide, even just a warp of 32 threads accessing contiguous floats will get full memory coalescing. I haven’t benchmarked things carefully, but you probably want the number of threads to be divisible by both 32 and 12 (384/32 bits), which would be at least 96. Someone who has more optimization experience should comment on this, though. (Section 5.1.2.5 of the programmer’s guide also mentions that delays due to register dependencies are minimized if the number of threads is a multiple of 64, so maybe you’d even want to go up to 192 threads per block to make everything happy.)

I don’t think I understand the wait time/time slicing comment. There may be some overhead in starting a new block, but I’ve never heard anyone discuss it as a significant part of the total compute time. NVIDIA’s docs encourage the use of many blocks to ensure your code continues to scale across future devices, which will have more multiprocessors.

Thanks again for the reply. I will try and clarify the wait time/time slicing comment. I am probably not wording it right, since I am just getting to the point of knowing just enough to be able to ask questions! To try and make an example, suppose a given block only has 32 threads and accesses floats. The memory access for that block/warp of threads will coalesce. Next, suppose that during execution of the kernel another block is then swapped in to the multiporcessor, because the first block is waiting for global memory access. The second block will then make another coalesced access to global memory. So, it would seem that since the blocks are I/O bound (since they are time slicing because of waiting for global memory), the blocks would want to have as many threads as possible to coalesce as many global memory accesses at one time (in a given block). I guess my questions stems from my thought that one large coalesced global memory access will always be faster than several smaller coalesced accesses. I hope that makes some sense–if not, maybe I am having a problem understanding how coalesced memory accesses work between blocks.

I see, this isn’t a big deal, because you don’t have to read that many floats at once to get a coalesced read. The point of coalescing reads is to ensure that you use all 384 bits of the memory bus (or however wide it is on your particular card) on every transfer. So as long your warp is requesting 12 contiguous floats (or some other data of the same size), then the read can be coalesced. That’s not very many words when you have 192 threads running in a block, so one block will probably need make many coalesced reads, one after the other. The same goes for other blocks running on the multiprocessor. There’s not much danger of fragmenting your reads unless you only have 8 threads running per block for some reason.

If there are multiple blocks running, then you do get some benefit, since while the first block is waiting for the read to finish, the second block can be using the otherwise idle ALUs. Of course, if the second block happens to be at a read stage as well, then it will have to wait in line for memory access, like you would expect.

It is generally a good idea to make the number of threads and blocks in your program an easily changeable number, so that you can benchmark different configurations. The interactions between the parts on the GPU are sometimes hard to model in your head, and a little empirical testing can go a long way. :)

Thanks again seibert. That all makes sense. I have taken your suggestion and made the num threads and num blocks easy to change in one of my kernels. This does make it very easy to change and benchmark. I will make it a habit in the future to caculate any other thread or block related variables from these constants. Again, thank you for your time.

Performance for different block sizes can be nigh impossible to predict. Of course, you have the memory coalescing issue: just follow the rules in the programming guide on that (all threads in a half-warp access contiguous memory locations with the proper start address). But then there are issues of multiprocessor occupancy, shared memory usage, register read after write dependances, texture memory access pattern, etc… As far as I am concerned, the only way to find the best block size is to try them all! (assuming your algorithm can adapt to different block sizes easily) The one you benchmark as the fastest is the fastest.

In my testing, I have found that the fastest executing block size even depends on the problem size (which changes memory access patterns and the number of blocks run).