"undefined" order of blocks / warps?

jmutch · May 28, 2007, 6:28pm

Hi everybody. This is my first post here – hope it’s a good one. (I’ve just finished reading the CUDA manual cover to cover, twice…)

I understand that the scheduler needs some latitude in deciding which blocks of a grid, and which warps of a block, to run at a given time. Things stall while waiting for memory, etc., so the scheduler needs to be able to get something else going in the meantime. Sounds good to me.

However, it seems to me that in certain applications, it would be very helpful to know something about how the scheduler chooses the next block/warp to run. The main reason for this would be to help ensure coherent access to memory.

For example, my first CUDA application is a form of convolution. In convolution, two adjacent outputs are computed from ranges of the input that almost completely overlap. So if one of my warps stalls, in order to maximize the probability of cache hits, it would be highly advantageous if the very next warp in the block were the one most likely to run next. Similarly, if a multiprocessor is going to process more than one block at a time, it would help if those blocks were close together in the grid. Does this make sense?

For all I know, the scheduler is totally random. Is there any documentation that shines a little light on the scheduler’s “undefined” behavior?

Jim

MisterAnderson42 · May 28, 2007, 6:57pm

I worried about exactly this same issue before I implemented part of my application. In the simplest form every output point samples ~80 elements from a ~5MB dataset completely randomly. The 2D texture cache works wonders, and I sustain a 21GB/s transfer rate. If I perform a sort on my data so that nearby points in the output access nearby points in the dataset, the memory transfer rate shoots up to 61GB/s. Not too bad for semi-random data access. Although, I may be compute limited now (135 GFlops with very few MADDs). I observe only a slight performance edge when running one block per multiprocessor compared to running multiple blocks per multiprocessor. But, that may be due to a higher warp occupancy and not due to the 2nd block thrashing the cache.

With an undefined order of block computations, I think the best thing that can be done is to ensure that the different threads in a warp access nearby elements in the 2D texture cache. Given that my performance numbers are pushing close to the device limits, I don’t worry about how the block ordering affects the texture cache anymore.

My advice is to just try out some test code and see how closely you push the device limits of memory transfer and FLOPS, you may be as surprised as I was at the performance you can achieve. Although I do agree it would be interesting to know if the device attempts to place nearby blocks on the same multiprocessor.

jmutch · May 28, 2007, 8:04pm

Thanks, MisterAnderson. That’s extremely helpful.

Topic		Replies	Views
Block and thread scheduling/ordering questions CUDA Programming and Performance	14	6094	November 15, 2007
What is the execution order of cuda blocks? CUDA Programming and Performance	3	225	July 17, 2024
About threadblock scheduler CUDA Programming and Performance	5	155	March 19, 2025
Multi-dimentional blocks and warps Attempts to achieve perfect coalescence. CUDA Programming and Performance	5	3341	August 1, 2009
How do warps IDs affect performance of CUDA kernels CUDA Programming and Performance	3	601	April 14, 2021
Thread and Instruction Scheduling CUDA Programming and Performance	3	3395	August 17, 2007
Block Dispatch Order CUDA Programming and Performance	6	147	November 16, 2025
performance cost of too many blocks? CUDA Programming and Performance	12	2983	December 4, 2018
Forcing a CUDA thread block to yield CUDA Programming and Performance	3	2315	January 5, 2012
Scheduling block execution Do multiprocessors block each other? CUDA Programming and Performance	45	23193	June 7, 2010

"undefined" order of blocks / warps?

Related topics