Hi–I am pretty much a newbie with CUDA development. I have been reading all the posts fairly regularly and I have found them to be very instructional. They have answered some of my questions. I have done several basic kernels at this point, just to get some familiarity and start to answer questions or at least start to be able to ask some questions. There are a couple things I don’t fully understand about registers and shared memory I am hoping someone can help me with.
The CUDA programming manual state that
This says nothing concerning the number of blocks to be run in a given multiprocessor. What happens when time slicing kernels if the total number of registers across blocks is greater than the given multiprocessor’s registers? Are they temporarily transferred to global memory? Off hand, this would not seem to make sense, since time slicing is to some degree governed by memory writes/reads. This would seem to compound the I/O bounding issue (relatively long global memory access delay) that caused the block to be switched in the first place.
The shared memory question is very similar to the register question. If the total number of blocks requires more shared memory than is allocated to each multiprocessor, what happens when one block is switched out for the next? I would assume it would transfer to global memory. But again, this would seem to compound the I/O bounding issue that caused the thread to be switched out.
Any information would be helpful. I have not found answers to these questions in the programming guide. Thanks in advance for any help.