How Are Idle Warps Scheduled?

If I have a problem that only needs, say 32 threads to solve it, but I launch 128, what does the scheduler do with the other 3 warps? Does it know that they have no more work to do so they will never be scheduled? The reason I’m asking is because I have a fairly complex problem that needs a specific size of warps to attack different size problems. I’m still launching fewer warps total than the card can handle at one time, so I’m wondering if there is any performance degradation compared to launching fewer warps that are all doing active work.

The only way you can indicate that a warp is “idle” according to your definition is via some code construct, like:

if (threadIdx.x < 32) {


}

The if statement at least must be executed for every thread/warp, and so all of the defined warps will be scheduled. Broadly speaking, the overhead for this should be small.

Thanks, I was suspecting exactly as you said. It will be similar to the code above. Let me elaborate more on the problem just so it’s clear why I’m doing this. I have about 500 arrays of data that I need to apply some function to, and each of these arrays are likely different sizes. The “optimal” way I thought of was to break the 500 array sizes down into 8 bins, while trying to keep the distance from the max of one bin about the same as the distance to the max of another bin. Since the data can be multiple 1000s of elements, I want to process each one optimally by launching as many threads as possible. The way I do this is the maximum size of the data in each bin is used as my “threads_per_array” variable, which can be something like 4000. Since there may be much smaller sets of data in that bin (say, 2000), there will be many idle warps for those ones. Each of the bins is launched with a different amount of blocks/block sizes in a separate hyper-q stream.

The only other way I figured I could do this is by mapping the thread id to array id, but this could be quite a large structure that has to be copied to the device each time my kernel starts, which I figured may be more expensive than the way I’m doing it above. I hope that make sense.