CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ?

Hi everyone.

I’m new to this forum and this is my first post. I’m a C / x86 programmer of 20 years, and I’ve read as much of the NVidia CUDA docs as I could get my hands on, particularly the PTX ISA 2.1 documentation. It’s very exciting stuff.

However, having scoured the documentation, I still find myself with some fairly basic questions, most, if not all of which have to do with what the docs call “Parallel Synchronization and Communication”. Two questions which stand out at the moment are:

  1. Can the PTX “bar.red.op.pred” instruction specify a thread count of zero? I ask this because if it is legal to do this, would that then make it behave like the “bar.red” equivalent of the “bar.arrive” instruction?

  2. More generally, I wish to find a way for a Kernel to be able to suspend its own execution until a global memory variable has reached a certain value. Has anyone ever needed to do this, and/or how could this be done? Does anyone know if this is even possible?

In terms of C, I’m looking for a way to encode a “void Wait4ValGE( unsigned *Global, unsigned Val )” function that will suspend execution of the thread until *Global >= Val, as in the following C psuedo-code:

#define N 4096

uchar Ready[ N ]; // <== global to all threads…

void Calc_Var( unsigned n, unsigned m ) // <== a callable function…
{
Wait_For_GE( &Ready[ n ], m ); // <== would suspend execution until Ready[ n ] >= m…
Do_Some_Stuff( n, m );
++Ready[ n ]; // <== this could also include extra sync code to support Wait_For_GE()…
}

void Kernel_Func( ushort ThreadID )
{
ushort n = 0, m = ThreadID;

do
{    Calc_Something( n, m );
     Calc_Var( n++, m++ );
}
while ( m < N );

}

I realize the above is not in the correct CUDA C/C++ format - it’s more just a conceptual way of posing the question. Obviously, the “Kernel_Func” would be implemented as a 1D CUDA Kernel thread which, in this case, would (hopefully) have N (4096) instantiations.

My current thinking is that, assuming a local predicate variable, “P” in function Calc_Var(), a “bar.red.or.pred P, 0, 0, 1” instruction could be inserted immediately after the “++Ready[ n ]” (assuming that’s even possible - see question 1), which would signify an “arrival” at “barrier” 0, then the Wait_For_GE() function could loop a “bar.red.or.pred P, 0, WARP_SZ, 0” instruction, breaking out of the loop only when P is true (i.e. ++Ready[n] has been executed), AND Ready[ n ] >= m.

Originally, my idea was to have the Wait_For_GE() function looping the “bar.red.or.pred” instruction with a thread count of one. But the PTX docs specifically say:

“Since barriers are executed on a per-warp basis, the optional thread count must be a multiple of the warp size.”

So the thread count must be WARP_SZ.

Basically though, the above idea sucks in terms of the chances that it could work, so I’m looking for a better way to do this.

Any ideas anyone?

I’ve never seen a way to suspend execution waiting on a change in global memory. At best, you could busy wait on a condition variable. However, this seems overly complicated to start GPU processing in response to a condition on the host. The preferred approach in CUDA is to wait and launch your kernel from the host at the moment you need to start processing. Are you worried about the latency or launch overhead?

Thanks for the reply, seibert.

Actually though, I’m seeking to “start” (actually “continue”) GPU processing in response to a condition on the GPU, not the host. The GPU will be the one that both sets and tests the “Ready[ n ]” variable. In this scenario, the only thing the host will do is to start the 4,096 Kernel threads.

The thing is, each thread will loop anywhere from one to 4,096 times, each time executing Calc_Something() (which is substantial), followed by a call to Do_Some_Stuff() - but Do_Some_Stuff() can only be executed if and only if it has already been called a certain number of times by other (concurrent) threads.

I realize that the “proper” way to do this is to “unroll” it all, then somehow rewrite the (potentially huge) result so that the host makes all the decisions about when each ‘part’ can start, using calls to CUDA functions like cudaStreamWaitEvent(), etc. And I’m still considering that as a possibility, but it would mean executing the code symbolically, then writing a program that could parse the result, and output the optimal CUDA stream management C code. That seems like an awful lot of work, would yield an awefully big (and proably ugly) host program, and would simply not be as ‘elegant’ as the original algorithm.

The “busy wait” method is also an option, as you’ve pointed out, but let’s face it - that would seriously undermine performance, compared to having the GPU ‘block’ while waiting.

Since I made the original post though, it’s occurred to me that I might be able to use the “vote” instruction for the “bar.arrive” functionality that I mentioned. That seems slightly more possible than a “bar.red.op.pred” instruction with a thread count of zero.

However, still looking for something that I may have missed about the whole CUDA syncronization thing. Maybe not. Maybe it simply isn’t possible to have a CUDA kernel ‘block’ its own execution until some condition is met. Dunno.

Currently there is no way to suspend execution of a running block, even though the PTX manual mentions in a few places that Nvidia reserves the right to implement this in the future.

If you want other work to be done while waiting for the variable to change, you have to explicitly do that other work in the loop that tests the variable.

Greg Diamos wrote something about the problems inherent with global synchronization (which is what this is) a year and a half ago:

http://forums.nvidia.com/index.php?showtopic=108604

Thanks for the link, tmurray.

Interesting stuff, no doubt. However, although I can’t claim to have a complete understanding of his proposal (it’s fairly cerebral), I was able to glean that his proposal would expand the CUDA programming model to facilitate kernel code that “would not have to go back to the host to launch the next kernel”. Very on-topic.

Unfortunately though, the catch is that it’s specifically targeted for “algorithms that can easily re-create their state after global synchronization”, and the algorithm I’m trying to implement doesn’t fall into that category. Darn. Missed it by that much…

I definitely appreciate the food for thought though. Any related articles are welcome…

Are you saying that the “bar.sync” and “bar.red” PTX instructions don’t actually work? The documentation (PTX ISA v2.1) explicitly states that these instructions “cause the executing thread to wait until all or a specified number of threads in the CTA arrive at the barrier before resuming execution”. They go on to state, “When a barrier completes, the waiting threads are restarted without delay”.

The aforementioned docs also mention that “At every instruction issue time, the SIMT unit selects a warp that is ready to execute”. This strongly implies that a running warp can be not “ready to execute”, i.e. be in a wait state, otherwise there would be nothing to “select” - warps in a CTA would simply be serialized.

But speaking of things to implement “in the future”, how about a new “suspend.until d” instruction, that would suspend execution until “d” becomes non-zero? That sure would be nice, but for all I know, perhaps the “bar” family of instructions already exists to be both an efficient and flexible mechanism for achieving just that kind of functionality, and more. I don’t know. That’s what I’m trying to find out.

In fact, the bar.sync and bar.arrive instructions are already perfect for what I need, except for one major flaw - the bar.sync instruction signals “the arrival of the executing thread(s) at the named barrier”. Unfortunately, that just isn’t the case, because I’m defining the “barrier” as the point immediately after an executing thread has updated a single memory variable, not the point where it suspends execution until the barrier is reached.

If fact, I have no idea why the NVidia designers would even include that signal in the bar.sync instruction’s functionality. It seems to me that if you’re going to define a virtual barrier at all, you wouldn’t want the place where you wait until it’s been “arrived at” to also be considered an “arrival” at that barrier. It just doesn’t make sense to me (yet?).

They definitely do work, as they are the implementation of __syncthreads(). Warps in a block reaching a __syncthreads() call (i.e., the bar.sync instruction) will no longer execute instructions until all warps reach the same barrier. That is a very restricted sort of suspension, not directly suitable for the wait-on-condition-variable kind of suspension you are looking for. I imagine that one could busy-wait one thread in a while loop checking the condition variable just before the __syncthreads(). All the other threads in the block would reach the barrier and wait, while the one monitor thread loops.

The primary use case for the barrier instruction is to manage race conditions between threads in a block when accessing shared memory. It is very normal in a CUDA kernel to need to scatter partial results between threads, and the standard way to do this is to have each thread write the partial result to shared memory, wait at a barrier for all threads to complete writing, then continue and allow threads to grab the results they need from shared memory. The barrier is essential to ensure that all writes have finished before reads begin.

An efficient barrier for waiting on a memory location state change is a much more sophisticated operation that sounds like it would require new hardware support. NVIDIA has been reluctant so far to provide a global barrier, beyond launching another kernel. I think this is because they want to encourage kernels to oversubscribe the device, and therefore automatically make full use of future devices with more SMs than the current generation. Global barriers (as mentioned in tmurray’s link) as typically specified conflict with this goal.

Actually, the bar.sync is far more flexible than the __syncthreads() function, because the bar.sync instruction allows for the specification of a thread count (even though that count has to be a multiple of the warp size), which the __syncthreads() function does not. You would think that the __syncthreads() function would be upgraded with an optional warp count parameter by now, but hey, I’m sure they’re busy with other things…

I don’t know from “directly suitable”, but is it exploitable for that purpose? Take a look:

As I mentioned, the only problem with the above is that the bar.sync instruction will signal an arrival at barrier 0, which it damn well should not External Image (do I sound upset? - nah, just being emphatic External Image )

I think I know what happened though. The bar.arrive instruction is a recent addition to the instruction set - introduced in the PTX ISA v2.0, according to the docs - so all they had before that was the bar.sync instruction. That being the case, it would make sense that the bar.sync instruction would signal an arrival at the barrier (I guess), because essentially, nothing else would or could.

But the arrival (excuse the pun) of the bar.arrive instruction changed all that. Now it just doesn’t make sense that the bar.sync instruction signals the “arrival” at a barrier, because no matter how you look at it - it’s not “arriving” at the barrier, it’s synchronizing them - there’s a difference!!

Look at it this way: If you took out the signaling of “arrival” from the bar.sync instruction, you could still implement that functionality quite easily - simply precede the bar.sync with a bar.arrive. But the reverse is not true - i.e. you can’t remove the signal from the bar.sync instruction, even if you don’t want it - even if it will significantly impede performance. So it would be far more flexible to take it out. It would also make a whole lot more logical sense.

Right. And that is exactly how I’m trying to use it. The “partial results” you speak of is the Ready state variable array (in shared memory no less), but instead of waiting “for all threads to complete writing”, I’m using the “new” bar.sync thread count parameter to wait for “any thread to complete writing”. Not a huge difference, just a finer granularity (actually, it’s a warp granularity, but that’s only because of a current hardware implementation limitation that may or may not disappear in the future).

If you’re referring to my earlier suggestion that NVidia add a “suspend.until” instruction, then yes, that would most definitely require new hardware support. But that suggestion was kind of tongue-in-cheek anyway; I just said that to try to make clear the end result I was going for. However, and at the risk of flogging a dead horse, let me be perfectly clear: it would definitely not require any new hardware support to take out the “arrival” signal from the bar.sync instruction. Less is more in this case.

Oooh - a conspiracy. What fun… But I have to admit, you kind of lost me on the rest of that… A global barrier would discourage kernels from oversubscribing the device? Y’okay, um… fire bad, tree pretty…

I was referring to the fact that it is not possible to suspend and restart a block (CTA in PTX-speak) in a kernel. [font=“Courier New”]bar.sync[/font] and [font=“Courier New”]bar.red[/font] synchronize warps within a block/CTA.

As you said you want to synchronize 4096 threads, which is larger than the maximum block/CTA size on all current devices, I was under the impression you wanted to create an inter-block synchronization, which is not possible currently (and would also be difficult to implement and encourage a programming style that does not scale well to future devices as others pointed out already).

Quite right, you did say “block” - I misunderstood. Being relatively new to this stuff, I’m used to thinking in terms of threads, warps, and CTA’s - the book stuff; I haven’t quite mastered the common colloquialisms yet. So a block is synonymous with a CTA. Got it.

And as you’ve pointed out, I’m also guilty of glossing over the limitations of the CTA, in terms of what the current state of the tech defines as the maximum number of threads (or warps) that a block can have. I suppose I was operating under the false assumption that if I purchased the latest and greatest NVidia GPU board (a Tesla), with the most number of “CUDA cores” possible, that it could somehow accommodate 4,096 separate threads in a single block. After all, that’s only 128 separate warps. But no, it looks like there’s a hard limit of 1,024 threads per CTA, even on the latest Tesla boards (correct me if I’m wrong).

In fact, I have to confess that I’m still a little fuzzy on how terms like “CUDA core”, “Streaming Multiprocessor”, “Scalar Processor core”, “Thread processor”, and “GPU” all relate to each other, and to how many threads a CTA can have. The spec sheets for these boards seem to speak a different language than the programming manuals.

Is a “CUDA core” the same as a “Streaming Multiprocessor”, or is the latter what the spec sheets call a “GPU”?

The docs say that the “Streaming Multiprocessor” “implements a single-instruction barrier synchronization”, so it’s a pretty safe bet that it’s what is responsible for executing a block. And the docs also state that “The multiprocessor maps each thread to one scalar processor core”. So where on the spec sheet is the “# of scalar processor cores per Streaming Multiprocessor”? Well, it’s not there. Or maybe it is, but in some different language I have yet to fully comprehend. It’s all very confusing.

So anyway, as it turns out, I’m not too worried about it, because the particular algorithm that I’m working with now can be partitioned into smaller chunks if need be, but at the cost of additional processor time. So instead of 4,096 threads in one block, I could have 4 blocks of 1,024 threads each, but then the results of each block would have to be integrated into a final solution, at the cost of some additional processing time. Hopefully, that “additional processing time” will be a small fraction of the total, but I don’t know that yet.

What I am worried about is whether or not the core of the algorithm can work at all, regardless of how many threads it uses. And at this point, I think it will, but any opinions to the contrary are welcome.

As for whether or not this algorithm encourages “a programming style that does not scale well to future devices”, I just don’t see that. All I’m really trying to do is to successfully employ the processor’s own barrier mechanism to accomplish efficient data synchronization. That’s pretty much all there is to it.

Except, of course, for the fact that, as I’ve already mentioned, I think that the barrier mechanism itself is in need of minor repair…

A core is the basic scalar execution unit, a streaming multiprocessor (SM) is an SIMD like array of cores and a GPU is one or more SM, plus support fixed function hardware, memory controller, etc.

Appendix F in the current programming guide lists the “abstracted” hardware features of each of the 5 different CUDA capable architectures or sub architectures which have been released to date. The answer to your question is either 8, 32 or 48 depending on which hardware you are using.

True, although I haven’t figured out what the goal of that feature at the PTX level is. Things take a while to percolate up to the CUDA C level.

To be honest, I am having a hard time following this clever use of the barrier. It might work, it might not. All I’m saying is that you are ramming a square peg into a round hole, as far as CUDA usage goes. :)

Uh, no conspiracy. You say you are new to CUDA, so I’m providing a bit of history to give you some sense of how NVIDIA has been developing CUDA so far. NVIDIA doesn’t typically share future plans with the community, so all we have to go on in discerning their thinking is past behavior. Since about 5 minutes after releasing CUDA 0.x people have been asking for global barrier, and it still hasn’t manifested yet. (I realize that we’ve now figured out that you don’t in fact want a global barrier, but a modified barrier within the block.) We can only assume that it is either (a) really hard or (b) the people at NVIDIA in charge of guiding the evolution of the platform don’t want one for some reason. However, it’s probably not worth going off on a global barrier tangent since that doesn’t seem to be relevant now…

If you haven’t already, I would suggest you read the first half of the CUDA C Programming Guide. Terms like “block” are defined in the first two chapters, for example. The Programming Guide was the only documentation we had for a while, and PTX was originally designed for internal consumption. After enough prodding by people wanting to do lower level work (write compilers, translators, etc), NVIDIA published a proper PTX manual, but CUDA C is still the main way people interact with the device.

To facilitate a warp-level barrier, as opposed to just a block-level barrier? Does that not make sense? Is that not a good thing? Isn’t it better, from a purely performance perspective, to suspend only a single warp, as opposed to a whole block? In fact, I’m looking forward to the day when they can finally implement a true thread-level barrier, which I believe is where they’re headed, as that would match the documented “thread count” syntax of the instruction itself.

That I can certainly understand. Perhaps they’re waiting until their hardware people can implement a true thread-level barrier (I can only hope) before they add a new parameter…

Yes, that’s definitely a good way of phrasing what I sensed was your opinion on the matter. But I have to respectfully disagree with that assessment.

When I first encountered the bar.sync instruction, my initial impression was that it was referring to some kind of instruction barrier - i.e. once this instruction has been reached, you’ve arrived at a “barrier” - a place where you wait for all other threads to “catch up” to where you are, and then proceed.

I think this is how you still perceive this “barrier” concept. You see it as a “square peg”, as it were. But although it can be used as such, that just isn’t the whole story.

The story continues, and gets much more interesting, when you look at the bar.arrive instruction. At first, I didn’t understand how it could be used, because I had it stuck in my head that the bar.sync instruction was the barrier itself (the place to “wait” until all other threads “catch up”). But I racked my brain until it finally dawned on me that the barrier is not an instruction, it’s a purely virtual construct that doesn’t actually have to have a single location at all.

To make a long story short, a PTX program can have multiple bar.arrive instructions in it, all of which refer to the same “barrier”. That makes a “barrier” very much a user-definable concept. A barrier can signify a change in state, user input, an interrupt-level error, the point when a green pixel turns blue, or however else the programmer chooses to define it.

The bottom line: If this were not the case, there would simply be no need to have a bar.arrive instruction at all.

In other words, if the barrier mechanism was supposed to be exclusively used, as you’ve said, to “wait at a barrier for all threads to complete writing”, then you’d never need the bar.arrive instruction. It would be completely superfluous. You could implement your entire aforementioned functionality with just the bar.sync instruction, and be done with it.

But since when do processor manufacturers go out of there way to add useless instructions to their instruction set? They never do. I think the bar.arrive instruction was specifically added because NVidia knew that it would increase the utility of their barrier mechanism by several orders of magnitude.

But should their C compiler team be the only ones who know that? I certainly hope not.

I actually do appreciate your decision not to go off on that particular tangent. A global barrier would be nice, especially if it came with it’s own “arrive” instruction, but I can respect how hard that would be to implement in hardware. I would liken it to Microsoft being able to provide a function that could suspend the execution of several running DCOM objects. Not an immediate concern, at least for me.

Well, I did start with both that, and the CUDA Reference Manual, but I guess my retention isn’t what it used to be - maybe I’m getting old. I read as much of both of them as I needed to get a good idea of how it all works, but at some point I realized I needed greater control over the GPU than the C interface could offer - specifically, at the time, the ability to efficiently handle arithmetic carry and borrow. Yes, I know there are C algorithms that can “accomodate” a carry-through, but speed is also very much an issue. So thanks to Google, I found PTX, and I haven’t looked back.

Thanks for the history lesson, BTW. I tend to think of this tech as history in the making anyway, so someday someone (you, maybe?) will write a book about how it all got started - they’ll call it “The CUDA Network” or something, and make it into a blockbuster movie…External Image

No, I understand the low level concept, but I’m trying to think of an application of it. (Perhaps the example you have that I don’t quite follow is the first.) My usage of CUDA has been shaped by the path of least resistance in the language, so when I want to synchronize things, I want to synchronize the entire block. Usually it is because I’ve put some kind of table in shared memory that all threads are going to access randomly, so a block-wide barrier is the natural choice. Beyond that, I try to think in terms of no-barrier programming using atomic operations or data access patterns that are automatically race-condition free. If that doesn’t work, then I just don’t use CUDA. :)

Then perhaps a better way to phrase it is that you’ve shown up to stuff a square peg into a previously round hole that NVIDIA is still carving flat sides into. :) NVIDIA employees do read the forums, but seldom respond to threads like this if it involves a future feature they are going to announce. (Unless it is imminent, then tmurray will spill the beans sometimes.)

Fair enough, but the PTX manual is a brutal way to learn CUDA. I should note that with CUDA 4.0 (release candidate available to registered developers, soon to all developers) will officially support inline PTX in CUDA C code. So you will be able to mix and match as well.

It is probably also worth playing the devil’s advocate for a moment and point out that PTX is a virtual machine assembly language. Not everything defined in the reference is either emitted by the compiler or assembled into an analogous instruction by the assembler. So even though bar.arrive exists in the virtual machine language syntax, it doesn’t mean it really exists in current hardware and software implementations of CUDA or is actually used for anything. bar.sync is certainly emitted by the compiler, but I can’t ever recall seeing bar.arrive.

Okay, I guess I can live with that. It will become a perfect fit when NVidia removes the “arrival” signal from the bar.sync instruction. Then everyone can use it, even rebels like me.

Thing is though, I didn’t set out to pervert anyone’s idea of how the GPU is supposed to work. My task is actually quite daunting, so I have to squeeze every bit of juice I can out of any tech I can find. I have to reduce a serial process that currently takes one core of a dual-core 2.4 GHz Intel processor one full month to execute, down to exactly 1/100th of a second. Then I have to run that for a full year, non-stop. That’s going to take some outside-the-box thinking, not to mention about 8 separate Tesla boards, probably a 6-core Intel processor, and a whole suite of the latest and greatest software development tools.

So it’s fairly important to me that I not only have all the software written to do that before I shell out the big bucks for such a system, but also to be at least 95% sure that it will actually work on the hardware. Suffice it to say that I’m not there yet, on either score.

Actually, I find assembly about as easy to take as C - I’ve worked with both for such a long time now, that which one I use depends solely on what I need to accomplish. And yes, I’m definitely looking forward to seeing the inline assembly feature implemented in a publicly available development tool. That really will be the best of both worlds.

P.S. I guess I really am a rebel - I’m seriously considering using D, not C, for the host code… But that’s another discussion for another time (and thread)…

Oh, great, throw gasoline on the fire. External Image

So now I’m even less certain that my code will work. Oh well, c’est la vie. External Image

I don’t think it wouldn’t be “actually used for anything”, if that’s part of your implication. I don’t think NVidia would add a fake instruction to their instruction set. If that were the case, wouldn’t it make more sense for them just to have made it a directive?

Anyway, interesting sidebar: the bar.arrive instruction has a mandatory (not optional) “thread count” parameter, but that parameter is never fully explained, except to say that it can’t be zero. It would seem to imply that the arrival ‘fades out’ once a certain number of threads have synchronized to it, but that’s just a guess on my part. It would appear to be more than a little superfluous.

Speaking of which, my previously posted code does have an error in that very parameter of the bar.arrive instruction - I’m giving it a thread count of “WARP_SZ”, but it should actually be the “special register”, %ntid.x, which in this case, would be the total number of threads in the CTA. That’s assuming, of course, that I’m using a 1D CTA, and the total number of threads in it is some multiple of the warp size…

D is definitely a new one, though should be quite possible using the Driver API. Quite a few people here call CUDA from Python, C#, FORTRAN and Matlab, so C is not the only game in town. :)

Some simple inside-the-box thinking tells me that to achieve this you would need a speed up of 30246060100 = 0.2592×10[sup]9[/sup] versus the CPU. Assuming the best case of a compute-bound integer or single precision task that does not use SIMD instructions on the CPU but fits ideally to GPU computing, a single Tesla board is about 215× faster than one core of a dual-core 2.4 GHz Intel processor. Assuming ideal partitioning over 8 Teslas can be achieved, they thus provide a speedup of 1720×, which still leaves a gap of five orders of magnitude. So the CPU implementation of your process needs to be a hell of an inefficient mess for your target to even be theoretically achievable.

Not that I want to discourage you, but there seem to be some myths about GPU computing that sometimes need a bit of debunking.

Now to get back to the issue synchronization and what [font=“Courier New”]bar.arrive[/font] may be useful for: I can certainly imagine some kind of producer-consumer scheme inside a block, where a given number of warps are dedicated to producing data which others are consuming. That would perfectly match the existing combo of [font=“Courier New”]bar.sync[/font] and [font=“Courier New”]bar.arrive[/font], where compared to just the former the newly introduced latter instruction would allow one kind of warps to go ahead without waiting for the others.

I could also imagine how to combine these with atomic operations in shared memory to provide the functionality you asked for, but that would certainly get into the way of your overall aim of maximal efficiency. Maybe you can sketch in a bit more detail what you want to achieve, and with some luck we might be able to suggest a more natural way of doing that with CUDA.