questions about thread execution & volatile

I have two questions:
1.Suppose two thread blocks are scheduled onto the same SM, and each block contains 2 warps. If warp 1 from block 1 has just executed instruction A, now the SM has 4 choices to execute:warp 1 from block 1, warp 2 from block 1, warp 1 from block 2 and warp 2 from block 2. Which one will SM choose? I guess warp 2 from block 1 will be chosen, so that threads from the same block will have the same progress. Am I right? What’s the policy of choosing warps?

  1. What exactly do “ld.volatile” and “st.volatile” mean? The manual says they are used to “enforce sequential consitstency between threads accessing shared/global memory”. Can anyone give me an example showing non-volatile and volatile version of ld/st may get different results?

Dont build your algorithms around the order in which the threads should execute, since it is now known.

As for your second question, the volatile keyword still eludes me.

Where does your quote come from?

Another question:is the execution of a warp interruptable?For example, if the first quarter of a warp has finished an instruction, is it definite that the second , then the third , then the last quarter of the same warp is executed?

yes, the warp is the smallest unit of execution

The quote is from the programming guide (version 2)

And also about 3 dozen times from NVIDIA guys on the forums pointing it out to those who ignore the programming guide.

I did not read the programming guide carefully. This sentence is on page 7. :">

I bet you could do experiments to find out what’s the scheduler policy here, but it is recommended not to rely on it, mostly because NVIDIA might change it without notice. You should stick to the programming model if you’re writing a CUDA application.

Just for the sake of speculation, my guess is that the scheduler would do a simple round-robin selection amongst warps available for execution (i.e. ones with no stalls). In case that is true, I expect an execution order analogous to be the following:

block 1 warp 1

block 1 warp 2

block 2 warp 1

block 2 warp 2

Note that you will have many instruction stalls because the total number of warps is less than 6. I believe someone from NVIDIA gave this number in the forums, but the forum search is obviously broken so I can’t find it now.

CUDA doesn’t say very much at all about the scheduler design, so we have to do detective work and see what fits the facts we DO know.

One reasonable scheduler design is that the order of warps is DYNAMIC… it’s a circular queue. Some warps get removed from the list because they’re stalled (for memory access usually). But when the memory is finally delivered, the warp gets put back into the active queue.
The alternative design would be for the scheduling order to be FIXED, like Anjul’s list in the post above, and the scheduler SKIPS inactive warps. That’s more work for the scheduler, since if one warp has to be skipped, the scheduler has to check if the NEXT warp is OK, and if THAT has to be skipped, it has to go to the NEXT warp, etc. And the scheduler has to do this FAST… effectively in less than one clock. So the variable “skip” length makes that zero-cost scheduling much harder to implement.
A dynamic active warp list makes it easier for the scheduler (time wise) since the only decision that needs to be made is whether to remove the warp from the active list or not.

Now what other clues do we have to confirm the design?
The scheduler can only handle so many warps… 32 in G200. That implies that maximum is the maximum size of that circular buffer. If the order was fixed, then there would likely be no scheduler warp limit. (You’d still be limited in practice by registers of course.)

Another big clue: “syncthreads()” is cheap. If the scheduler uses a circular list, syncthreads basically would just pop out any warp that hits the barrier from the active list, then when the active list is empty (including pending memory warps), the active warp list gets repopulated and starts up again.
That’s cheap. With a fixed order list, then syncthreads involves marking each warp with status flags and checking for them all to be set (and skipping over the waiting ones every time, and then clearing those bits after all the warps finish their flagging. Again, not expensive, but also not “only 4 clocks used” like the manual says.

So, supported by these clues, I think the GPU’s scheduler uses a variable length, dynamic, circular list, and therefore the answer to “what order are the warps scheduled?” is likely “effectively randomly, in a circular, but dynamic, order.”

Other interesting clue, just for brainstorming:
Half-warps of 16 threads seem to be a natural fit to the scheduling size, it’s the memory transaction size and it’s also even mentioned in the manual obliquely that threads are grouped into bundles of 16. So it’s quite possible that the hardware is capable of scheduling quantums of HALF warps of 16 threads! So why make warps 32? Because of that scheduling queue size… if you had a quantum size of 16 threads per warp, you’d need double the scheduling slots and therefore more scheduling hardware… It’s likely a design decision whether 16 or 32 (or 8 or 64) is most efficient for actual computation. It seems logical that smaller the better, (hey, make every thread independent, right?) but that’s actually wrong, you’d have to double, or quadruple, the scheduling hardware. And remember how well that hardware works… there’s NO scheduling overhead, syncthreads is cheap, it’s all automatic. It may be expensive (transistor wise) to scale that up. Possible of course, but is it worth the tradeoff?

Wow, that is a good investigation! Thank you for your thoughts.

I should have been more specific. I was looking at a circular queue myself, but it makes more sense to me that each slot in this queue is pre-assigned to a warp at kernel-launch. I think such fixed scheduling order seems more plausible from a hardware implementation standpoint. Given a small maximum warp count of 24 (G80) or 32 (G200), it should be simple to implement a combinational find-first-valid circuit that works in a single cycle. The scheduler doesn’t need to do a chain of skips across the queue, it just fetches the next valid index from this circuit. Moreover, if warps are allotted specific slots in this fixed queue (each of which may/may not be valid), it might be easy to implement a fast __syncthreads(). You know exactly which slots belong to the block in question, in fact, they will most likely all be in sequence. Hence to implement this barrier, the naive method would be to keep a barrier bit with each warp-slot (this also fits well with the “barrier resource” concept often seen in PTX listings - they may have more than a single bit there). This bit is set whenever a warp hits a barrier and the warp goes invalid. The job of the scheduler, now, is to compare this bit-vector to 2^n-1 (i.e. all 1s), where n is the number of warps in a block. Whenever this comparison succeeds, the scheduler just enables all threads on this block and sets this vector to 0. I think that can be done in 4 cycles.

Can you please elaborate on your explanation of syncthreads in a dynamic circular queue for the case when there are multiple blocks on the same SM such that all might not be going for a barrier? Would the active list be guaranteed to be empty when all warps in a block hit the barrier? I can’t seem to see that, unless there are multiple queues per SM. Well with enough spare bits, I think your model could fit as well. We’ll never know who’s right, unless someone from NVIDIA hints or I find time for some experiments. Maybe soon.


Tim Murray has said on the forum, that the current hardware could have a warp size of 16 (instruction decoder runs at half the speed of the ALU’s), but that 32 was chosen because future hardware might require a warp size of 32 (I guess it would have 16 ALU’s per MP). So it could very likely be hardware related (you make a lot of sense above), but the only thing I heard from NVIDIA about it was to have compatibility with future hardware, I guess the decision was made before the warp size became a builtin variable.

You’re thinking of this like a software programmer. In fact in hardware, it’s not difficult to check a dozen warps at once and flag the first one that’s ready. (I can already imagine a chain of transistors that can do this.) Implementing a circular queue is much more difficult (it’s very much an algorithm that works well as software. For one, it requires memory.)

To frame it in a rather different (and probably better) way, the fixed version is a parallel reduction, while the dynamic version is serial and hard to parallelize. Hardware is immensely parallel.

Actually, SPWorley is spot-on. The bigger the scheduling granulity (even if bank conflicts, coalescing, etc, all still operate on groups of 16 threads), the less scheduling hardware you need and the slower it can run. (Btw, today, granulity of branching and instruction issue is already 32-thread. It is bank conflicts and coalescing that may stop become 32 in the future.)

Also, keep in mind that groupings of 64 threads also play a role in current hardware. You want your blocks to be multiples of 64. (It might be related to register allocation, or it might be scheduler-related. I forget.)

It’s to avoid register memory bank conflicts.

You’re completely correct, I am thinking software wise. And yeah, checking N “ready flags” in parallel is just some extra wires or whatnot in hardware. Scheduling really could be implemented either way.

Well, we could try to actually test this all. Make some warps, have the kernel store clock() results in registers, artificially delay and mix up some warps with some double ops and global memory reads, sycthreads, etc. Then write out all the clock() results to global memory, on the CPU grab them, sort by clock() and we should have a list of what order the warps were called. This is assuming clock() returns unique and ascending values, but I think that is true.

Yes it is true. I’ve already tested it.

Interesting thing could be accomplished when Clock() is used in calculating condition for branching.

Or something like

Mem[(Clock() div ThisLineClockExecutionInClocks) - offset] = blockDim.x * BlockIdx.x + threadIdx.x;

haha very cool. And your results??

It would be great to see your kernel code as well.

well, according to decuda, there is no such a modifier in the hardware,

so I guess it serves as a hint for ptx assember not to optimize out memory reads/writes,

for instance, I found it very handy for warp-sized reductions, consider the following code:

volatile int *scan = (int *)(shared + HF + thid_in_warp +

		__umul24(thid >> 5, WS + HF + 1));

	unsigned cy = ..; // values for reduction

	scan[- HF] = -1; // identity element

	scan[0] = cy;

	cy = max((int)cy, (int)scan[-1]) & cy;

	scan[0] = cy;

	cy = max((int)cy, (int)scan[-2]) & cy;

	scan[0] = cy;

	cy = max((int)cy, (int)scan[-4]) & cy;

	scan[0] = cy;

	cy = max((int)cy, (int)scan[-8]) & cy;

	scan[0] = cy;

	cy = max((int)cy, (int)scan[-16]) & cy;

	scan[0] = cy;

unless ‘scan’ is declared volatile, compiler can reorder/remove writes to

shared memory which is needed for the alrogithm to work properly

Yes, I find it as a hint for nvopencc and ptxas not to reorder read/writes and always load from shared mem instead using values from registers.