CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ?

Thanks for taking the time to explain this in more detail, I see your point, it is cumbersome to always include the arrive in the sync. I think that you could probably avoid the problem of requiring all consumers to wait for all other consumers by using multiple barriers, one for each consumer, but it is much more complex (and probably slower) than it needs to be.

However, implementing signaling mechanisms like this in hardware is actually very hard, and there are a few subtle issues that make what you are asking for difficult to fix. Specifically, there is the problem of deciding when to reset the barrier.

Let’s say that I want one producer to signal multiple consumers. The consumers wait on a barrier, and the producer eventually signals the barrier to release the consumers. This is fine, but what about the case where the producer arrives first? Should 1) only the consumers that have already arrived wait? Or 2) should the barrier remember the signal until a certain number of consumers have blocked?

I think that the first case requires programmers to either ensure that this never happens (which means ensuring that all consumers wait for the producer), or issue the signal multiple times until all consumers have finished, which seems even more complex than the current implementation. The second case is what the current implementation of bar.sync/bar.arrive tries to enable.

Convincing someone to change this would probably require a solution to this problem that is relatively easy to implement in hardware. The best solution that I can come up with is separating the arrival count into blocking and non-blocking sets (so consumers wait until the blocking set is incremented and the barrier is reset when the non-blocking set is incremented), but this seems complicated and unwieldy as it would require even more new instructions. If you or anyone else has a better solution, I think that people who could actually change this would be willing to listen.

Also, I think that another reason for having a thread count in bar.sync was to allow some threads to be neither producers nor consumers, whether or not this is useful is debatable.

Sorry to intervene, but my one of my pet projects, cuqu, does just this. It provides a templated CUDA queue, where host and device can exchange ‘data structures’. Basically a user-friendly wrapper around a chunk of pinned-memory.

davide

I remember tmurray saying “Dont attempt such things… Bad things lie down that path…” when somebody was trying to implement CPU-GPU memory sharing … Check this URL - http://forums.nvidia.com/index.php?showtopic=190368

But that has not stopped people from experimenting with it…I have seen at least 1 GTC Poster co-authored by NVIDIA employee on this…
But tim never replied when I brought this to notice…

So, Whatever you are trying – Just keep this in mind… If a failure looks odd, You may want to re-think about your foundation.

Thanks for the link, I’ve been rolling my own versions of this, but it would be nice to have a library…

The belated disposition of this response is brought to you by the people who actually pay me to program their computer systems, most of whom, strangely enough, want their software written in a timely manner. External Image However, now that I have some available time again, I thought I’d get back to this discussion, because it has been fairly interesting.

Greg, you’ve made an excellent point, and one which, I have to admit, didn’t really occur to me until I read your post. I was thinking of the whole problem from a strictly end-user perspective. Not usually a bad thing, in and of itself, but in this particular case, as you’ve pointed out, there may be some non-trivial implementation issues involved.

So I thought about it, and then I thought about it some more. I’m not a hardware person, I’m a software person. So I obviously can’t explain how it can work in hardware. What I can do is attempt to explain how I see it working, both internally and externally, from a purely software perspective.

To begin with, and to make a long story short, I see the parameters of the problem as this:

  1. The syntax of both the bar.arrive and bar.sync instructions should not change.

  2. The bar.arrive instruction should issue an “arrival” signal to exactly as many bar.sync instructions as are specified in its thread count parameter.

  3. The bar.sync instruction should block until it has received as many “arrival” signals as are specified in its thread count parameter.

That would make the system both logical from an end-user perspective, far more efficient (as previously discussed), and at least conceptually realizable.

The obvious consequences of the above would include the following:

  1. the bar.sync instruction would not issue an “arrival” signal, and

  2. the bar.arrive instruction’s thread count parameter would have a clearly understandable and documented meaning.

Before I get into how I see the above arrangement working internally however, I should first point out that in order to do that, certain unavoidable assumptions had to be made, primary among which is that the CUDA system already stores, or at least keeps track of, a “state” for each of a block’s (1024) threads, and each of a block’s (16) barriers.

In actual practice, the CUDA system probably maintains a “state” for each warp, either in addition to, or instead of, a “state” for each individual thread, but to keep things simple, I’m assuming a thread-based granularity only. The methodology should be readily extensible to the former representation regardless.

In order for this to work therefore, the “thread state” would have to include this:

class CUDA_Thread	// <== Single Thread (partial) State Info..

{

//...

    volatile bool blocked;	// <== True if this thread is currently blocked..

    unsigned char BarrID;	// <== Barrier ID (zero-based ordinal) of the blocking barrier..

    unsigned int required;	// <== how many arrival signals are required to unblock..

// ...

};

And the “barrier state” would have to include this:

class CUDA_Barrier	// <== Single Barrier State Info..

{

    CUDA_Barrier()  { arrivals = tcnt = 0; }    // <== ..constructor..

volatile unsigned arrivals;			// <== how many 'arrivals' have occurred..

    volatile signed int tcnt;			// <== count of participating threads..

void bar_arrive( unsigned threadcnt );		// <== bar.arrive functionality..

    void bar_sync( unsigned tID, unsigned threadcnt );	// <== bar.sync functionality..

};

With that out of the way, the state variables (of a single block) could then be defined as:

.internal CUDA_Barrier Barrier[ 16 ];	// <== ..State Info for all barriers in a CUDA block..

.internal CUDA_Thread Thread[ 1024 ];	// <== ..State Info for all threads in a CUDA block..

As you can see, I’ve invented a new keyword, “.internal” - it basically means what(ever) you think it means…

I’ve also marked certain variables as “volatile”. This is actually a shorthand way of saying that these variables, in the actual implementation, would have to be “processor locked” before they could be read from or written to, as their contents are very much subject to multi-threaded access. Again, to keep things simple, I’ve excluded any “processor locking” code.

So, on the premise that “people who could actually change this would be willing to listen”, the following is how I see such a system working, explained in the language of C++:

/***********************************************************************************\

* utility function - re-evaluates all threads' blocked status for a single barrier: *

\***********************************************************************************/

void ChkAllThreads( unsigned char BarrID, unsigned arrivals )

{

    unsigned tID = 0;	// <== thread ordinal ("ID")..

do

    {    if ( Thread[ tID ].blocked && Thread[ tID ].BarrID == BarrID )

              Thread[ tID ].blocked = Thread[ tID ].required < arrivals;

    }

    while ( ++tID < NumThreads );

}

/**********************************\

* bar.arrive internal calculation: *

\**********************************/

void Barrier::bar_arrive( unsigned threadcnt )

{

    ++arrivals;

    if ( threadcnt > tcnt ) tcnt = threadcnt;	// <== the maximum bar.arrive thread count determines

						//     how many barrier synchronizations are required

						//     to happen before a reset of this barrier..

ChkAllThreads( this - Barrier, arrivals );

}

/********************************\

* bar.sync internal calculation: *

\********************************/

void Barrier::bar_sync( unsigned tID, unsigned threadcnt )

{

    CUDA_Thread *thisThread = &Thread[ tID ];

if ( thisThread->blocked = threadcnt < arrivals )

    {

         thisThread->BarrID = this - Barrier;

         thisThread->required = threadcnt;

    }

while ( ThisThread->blocked ) Sleep( 0 );	// <== ..or some hardware method..

if ( !--tcnt ) arrivals = 0;		// <== Barrier gets reset HERE..

}

I’m pretty sure that the above is more than you, or anybody else reading this thread, expected to see, so I’m going to hold off on any diatribe pertaining to why I think it would work until I get some feedback on what you (or any other interested party) thinks about it. Especially in light of the fact that I haven’t exactly tested this code out in a simulated CUDA environment…

Or perhaps this is more of a starting point for further exploration. Either way, here are some rather cryptic examples of the above ‘working’:

bar.arrive 0, 2 @ time T tcnt = 2, arrivals = 1

bar.sync 0, 1 @ time T+1 NOT blocked, tcnt = 1, arrivals = 1

bar.sync 0, 1 @ time T+2 NOT blocked, tcnt = 0, >>barrier reset<<

OR <<

bar.sync 0, 1 @ time T blocked, arrivals = 0

bar.arrive 0, 2 @ time T+1 tcnt = 1, arrivals = 1

bar.sync 0, 2 @ time T+2 blocked, arrivals = 1

bar.arrive 0, 1 @ time T+3 tcnt = 0, >>barrier reset<<

OR <<

bar.sync 0, 1 @ time T blocked, arrivals = 0

bar.sync 0, 2 @ time T+1 blocked, arrivals = 0

bar.arrive 0, 2 @ time T+2 tcnt = 0, >>barrier reset<<

Hello again Sylvain. I have to admit that when you first brought up the subject of “monotonic counters”, I thought you were referring to the use of my application-specific “Ready” state variable array. But no, it would appear obvious at this point that you were referring to how a ‘fixed’ barrier mechanism could possibly work internally.

My bad.

So, to answer your question, please see my response to Gregory Diamos’ excellent post about hardware implementation issues inherent in a ‘fixed’ barrier mechanism. In my response to his post, I attempt to explain, in the language of C++, how a ‘fixed’ barrier mechanism might work from both internal and external perspectives.

I would be very interested to know you opinion on the matter, especially in light of the fact that you are the author of a Tesla simulator. Perhaps I could interest you in spending weeks and weeks of your time testing out my little theory on your software. Hmmm… External Image Then again…

Does my solution use a kind of “monotonic counter”? Well, I don’t really know. I do know that it uses “counters” of a sort, but one of the more important of those is signed - i.e. it can assume negative values. It tracks the “participating threads”, and is signed precisely because it needs to accommodate the stochastic nature of synchronization instruction occurrences. It’s quite possible that it could be classified as a particular brand of “monotonic counter”, but I don’t know.

Excellent point. I stand corrected. So now all we have to do is convince NVidia to make the condition code an accessible register. External Image

Well, that really blows my mind. It’s one thing to “abstract the details” in a “symbolic assembler”, but it’s quite another to prohibit the capabilities of the machine itself. That really irks the hell out of me. External Image

So I guess the only thing left to say at this point is that NVidia should get it’s act together, and provide it’s bread-and-butter developer base with a more capable and intelligent software ISA. End-Users Unite!! External Image

Okay, okay, I get it. It’s just that programming in “assembler” has always been done by programmers for the sole purpose of increasing their control over the machine. Why else would someone program in assembly? From what you’re saying, one could effectively argue the point that programming in the PTX ISA is pointless, as it offers no tangible advantage over programming in C or C++… That, to put it mildly, is a major disappointment…

However, I take comfort in the knowledge that this is indeed a new and emerging technology, and there are bound to be growing pains associated with any new technology. So I’m willing to wait and see if NVidia can rise to the challenge(s)… External Image

I think “no tangible advantage” is not quite true. PTX provides a reasonable target for writing compilers, and definitely gives more control than CUDA C or C++. It isn’t the raw machine code of the card, but there is a tradeoff in committing publicly to an ISA for a very unusual architecture that is still evolving rapidly. NVIDIA can alter the capabilities of CUDA hardware much more rapidly than someone like Intel, whose real ISA is their public interface. Of course, this hinders people with the skill and need to hyper-optimize to the hardware.

Along these lines, the fact that NVIDIA is actually shipping cuobjdump with their toolkit now shows definite progress. For a long time, our only insight into the actual hardware instruction set of the card (and the machine code after PTX-level optimization) was decuda. That tool required some epic reverse engineering on the part of Wladimir van der Laan, and was valuable to a lot of people trying to understand the behavior of the hardware.

There’s probably also a bit of a culture shift to overcome. Unlike the CPU world, GPU manufacturers have not needed to release such low-level information on their chips until GPGPU took off. Instead, they only had to expose interfaces to high level languages like GLSL. I expect we’ll get to a more open state in the future, but it will take a little while…