Hi everybody. This is my first post here – hope it’s a good one. (I’ve just finished reading the CUDA manual cover to cover, twice…)
I understand that the scheduler needs some latitude in deciding which blocks of a grid, and which warps of a block, to run at a given time. Things stall while waiting for memory, etc., so the scheduler needs to be able to get something else going in the meantime. Sounds good to me.
However, it seems to me that in certain applications, it would be very helpful to know something about how the scheduler chooses the next block/warp to run. The main reason for this would be to help ensure coherent access to memory.
For example, my first CUDA application is a form of convolution. In convolution, two adjacent outputs are computed from ranges of the input that almost completely overlap. So if one of my warps stalls, in order to maximize the probability of cache hits, it would be highly advantageous if the very next warp in the block were the one most likely to run next. Similarly, if a multiprocessor is going to process more than one block at a time, it would help if those blocks were close together in the grid. Does this make sense?
For all I know, the scheduler is totally random. Is there any documentation that shines a little light on the scheduler’s “undefined” behavior?