"undefined" order of blocks / warps?

Hi everybody. This is my first post here – hope it’s a good one. (I’ve just finished reading the CUDA manual cover to cover, twice…)

I understand that the scheduler needs some latitude in deciding which blocks of a grid, and which warps of a block, to run at a given time. Things stall while waiting for memory, etc., so the scheduler needs to be able to get something else going in the meantime. Sounds good to me.

However, it seems to me that in certain applications, it would be very helpful to know something about how the scheduler chooses the next block/warp to run. The main reason for this would be to help ensure coherent access to memory.

For example, my first CUDA application is a form of convolution. In convolution, two adjacent outputs are computed from ranges of the input that almost completely overlap. So if one of my warps stalls, in order to maximize the probability of cache hits, it would be highly advantageous if the very next warp in the block were the one most likely to run next. Similarly, if a multiprocessor is going to process more than one block at a time, it would help if those blocks were close together in the grid. Does this make sense?

For all I know, the scheduler is totally random. Is there any documentation that shines a little light on the scheduler’s “undefined” behavior?


I worried about exactly this same issue before I implemented part of my application. In the simplest form every output point samples ~80 elements from a ~5MB dataset completely randomly. The 2D texture cache works wonders, and I sustain a 21GB/s transfer rate. If I perform a sort on my data so that nearby points in the output access nearby points in the dataset, the memory transfer rate shoots up to 61GB/s. Not too bad for semi-random data access. Although, I may be compute limited now (135 GFlops with very few MADDs). I observe only a slight performance edge when running one block per multiprocessor compared to running multiple blocks per multiprocessor. But, that may be due to a higher warp occupancy and not due to the 2nd block thrashing the cache.

With an undefined order of block computations, I think the best thing that can be done is to ensure that the different threads in a warp access nearby elements in the 2D texture cache. Given that my performance numbers are pushing close to the device limits, I don’t worry about how the block ordering affects the texture cache anymore.

My advice is to just try out some test code and see how closely you push the device limits of memory transfer and FLOPS, you may be as surprised as I was at the performance you can achieve. Although I do agree it would be interesting to know if the device attempts to place nearby blocks on the same multiprocessor.

Thanks, MisterAnderson. That’s extremely helpful.