CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ?

Gregory_Diamos · April 6, 2011, 1:16am

One point that may be worth mentioning here is that you can accomplish many of the same tasks that you would use a global barrier for with a non-preempting asynchronous task. This may or may not make sense for your application, and CUDA definitely does not provide any language features for doing this, but it essentially allows you to say something like:

“wait until these threads finish, then launch those other threads”

The difference between what you are asking for and this is that the first set of threads is run to completion before the second set is started. These are essentially tasks in NesC (TinyOS - Wikipedia) and they let you express a surprising number of interesting applications without requiring a complete preemptive scheduler and while bounding the state and processor requirements of correctly executing a program.

The key that makes this scalable is that it only allows you to express forward data-dependencies between tasks/kernels.

If you want to do this in cuda, you can setup a shared buffer in host memory, setup a virtual function table in device memory, and write a device function that sends a packet to the host asking it to launch a new kernel with a specified function and set of arguments. You can setup complex applications like this using continuations, and the main disadvantage is that your device state gets torn down when you start up a new kernel. However, like any application that uses a mechanism like this (tail-recursion), you can always manually setup your persistent data in global memory. You do pay for a round-trip to the host and the cost of completely tearing down and starting up a kernel.

Also, if someone with some compiler background wanted to take this further, it would be possible to do real preemption as a PTX pass and modification to the runtime to implement a ‘kernel-scheduler’ in Ocelot (Google Code Archive - Long-term storage for Google Code Project Hosting.). We’ve pegged it at about 6-9 man months of work by someone who knows what they are doing and we are not actively working on it.

Sarnath · April 6, 2011, 11:52am

Attempting to achieve inter-block synchronization is the last thing one should want to do with CUDA…

Billy_Rubin · April 6, 2011, 5:57pm

I appreciate your input. Your post has forced me to do some number-crunching myself, and frankly, the results are not as good as I thought they would be. They are however, significantly better than what your analysis would indicate, so I remain open to the idea that what I’m after is (barely?) feasible.

To begin with, I don’t know where you get your 215Ã— figure - i.e. whether you’re basing that on the ratio of operations per second between the two processors, or whether you’re basing it on some assumed implementation of Amdahl’s law, or perhaps some combination of the two, but my thinking has always been somewhat simpler, at least at this stage in the game.

The core of the problem (excuse the pun) is the multiplication of large numbers a large number of times. In the current serial implementation, the program multiplies two 315,653 decimal-digit integers 8,386,560 times (with different integers each time, of course). It takes a long time because, to put it simply, it performs each multiplication serially - it doesn’t start the next multiplication until the previous multiplication has finished.

The parallelized version, on the other hand, can perform each multiplication in parallel. Not incidentally, it does that part of the algorithm without the need for synchronization. If it takes the one core of the dual-core Intel processor “T” seconds to compute each multiplication, and assuming that each “scalar processor core” of the GPU is at least as fast at executing each multiplication, then 1,024 “scalar processor core” threads (one CTA) should get approximately 1,024 multiplications done in time “T”.

But I need to perform 8,386,560 multiplications - so I need to have 8,190 of these blocks running concurrently (8,386,560 / 1,024 = 8,190) if I want all of the multiplications done in time “T”.

To be honest, this is where things get a little fuzzy. avidday posted a response to my inquiries about how many blocks can be run on a single GPU board, and I was in the middle of posting a rather lengthy response to his post when Windows Internet Explorer completely clobbered it, and I really didn’t feel like re-creating it. Basically, I’m still trying to grapple with how to read the spec sheets for these boards - they all seem to specify different metrics, depending on the board (eg. “Fermi” has an “SM count”, but “Tesla” doesn’t).

However, as near as I can tell, a board can have up to 16 “SM” units on it, which translates into 16 concurrent CTA’s per board (I think). At that rate, I’d need 512 boards to get the job done in time “T”.

That’s not good. It is however, far better than your “gap of five orders of magnitude”. If I use 8 boards, the gap would be less than two orders of magnitude. Still considerable though.

However, as you’ve probably already noticed, all these calculations are based on getting the job done in time “T”, which is the amount of time it takes the one core of the dual-core processor to compute one multiplication. The question is, is that amount of time less than 1/100th of a second?

I don’t know that it isn’t. One hundredth of a second theoretically represents 24 million clock cycles on a 2.4 GHz processor, ignoring pre-emptive interruptions. That represents a lot of instruction executions.

And since I’m using serial recursive Karatsuba to compute each multiplication, it’s quite possible that each multiplication will, in fact, take less than 1/100th of a second, especially if I optimize it in PTX. But I don’t know that yet. I’ll have to sit down and do some serious profiling…

BTW, I’ve already considered parallelizing the Karatsuba, but it turns out that doing so requires massive amounts of memory (terabytes actually), so I’m discounting that for the moment.

I can think of some other optimizations as well, but they’re all dependent on how fast each multiplication can be executed on the GPU. For example, I’m not married to the size of each multiplicand - they can be reduced or expanded as need be - i.e. if I double the size of each multiplicand, I’ll only need half as many cores.

The host is also another resource that I’ve completely ignored in these calculations. It’s main use in the current algorithm is to bind (a.k.a. add up) all the individual multiplications together into a cohesive whole (concurrently, of course), but there may be some leftover processing power there that can be exploited. If I use a 6-core machine, that leftover may be considerable.

Also, the “one full month” period is actually only an approximation, as is the idea that it had exclusive use of “one core” of the dual-core processor. My laptop is, in fact, a 2.4 GHz dual-core machine, and it did take about a month - give or take a week - for the software to complete, but I also did other things on my laptop all the time - it wasn’t a machine dedicated to this task. And although the software used a single Windows thread to do all of its work, that doesn’t mean that Windows didn’t occasionally commandeer both cores for whatever it deemed necessary at the time.

Actually though, my confidence level in this project has taken a bit of a hit as a result of looking at these numbers. But it’s definitely not the first time that that’s happened. I’ve gone through at least five major paradigm shifts in this one project, and each one significantly improved it’s chances of success. The last one would have taken three years to run, and would have required 8,192 peer-to-peer machines hooked up to a 100 Gigabit Ethernet. Hardly realistic, but a necessary precursor to the current effort.

So I can’t help but think that I’m just a couple of tweaks away from a fully-realizable plan…

Thanks very much for that link - it looks really interesting. I’ll have to read it later though.

Yes, except that in my current algorithm, each thread is both a producer and a consumer, but only after each multiplication has been achieved.

And yes, the synchronization does get in the way of maximal efficiency, but the synchronization itself is designed to be minimal. There has to be some synchronization, because this application is not an embarrassingly parallel problem…

tera · April 6, 2011, 6:25pm

Don’t have time to read and reply to your post at the moment. But one quick comment, as your main problem seems to be multiplication of large integers: Have you looked into using FFT methods? These might easily provide the many orders of magnitude speedups you are looking for. GMPlib is a good ready-to-use implementation as a library.

BTW, the 215Ã— factor comes from the ratio of ALUs and frequencies. Without SIMD, a modern x86 core does 2 arithmetic operations per cycle (one mul, one add), resulting in 2 Ã— 2.4Ã—10[sup]9[/sup] = 4.8Ã—10[sup]9[/sup] operations per second. Incidentally a CUDA core does 2 arithmetic operations per cycle as well (one multiply-add). A Tesla C2050 or C2070 has 448 cores clocked at 1.15 GHz, for a total of 448 Ã— 2 Ã— 1.15Ã—10[sup]9[/sup] = 1030.4 Ã—10[sup]9[/sup] operations per second. The ratio is the speedup factor I used.

Billy_Rubin · April 7, 2011, 1:03am

Hi Greg.

Maybe I missed something in the CUDA Device API (actually, to be fair, I probably missed a lot), but how exactly does a running CUDA kernel, or a called function thereof, “send a packet to the host” ?

Or does your idea include setting up a table of host function addresses that one or more running CUDA kernels can use to make indirect calls into the host? I would not have thought that possible, because I’ve never heard of a GPU invoking a host function - it’s always the other way around.

Billy_Rubin · April 7, 2011, 4:59pm

Yes, I’ve looked into FFT, NTT, SchÃ¶nhageâ€“Strassen, Toom-Cook, and GMPlib - three paradigms ago. GMPlib source code in particular was a disappointment - their source code has glaring inefficiencies in it, and an unacceptable number of holes. Their “mul_fft.c” module, for example, which is actually their implementation of the SchÃ¶nhageâ€“Strassen algorithm, contains no less than 10 separate “FIXME:” comments. It also assumes an ability to allocate memory on the fly, how much of which is unclear, because both the algorithm itself, and their implementation of it, is more than a little complex. I spent a considerable amount of time attempting to rewrite it, but in the end, the Karatsuba method won out over all of these methods, because it is both fast and simple, has far better memory locality, both low and predictable memory usage, and is unambiguously demonstrable. That last point is especially important, because a requirement of this project is that it has to be ironclad - the validity of the result cannot be ambiguous or contentious in any way, shape, or form. It has to be inarguably defensible, and code riddled with “FIXME:” comments is hardly that.

However, I’ve recently been made aware that NVidia has a CUDA-based FFT library out there (CUFFT), so I may revisit the FFT method. However, using the FFT method for large integer multiplication does have the drawback of having to intelligently manage an inherent floating point precision error factor, which may introduce some defensibility issues.

All of which is way more than you wanted to know, I’m sure…

I don’t suppose you could tell me where I might find a reference or a link to the “448 cores” of the Tesla C2070 that you mentioned? What does that number represent in terms of concurrent threads or Streaming Multiprocessors?

Related to that, I just now read in section 3.1 of the PTX ISA 2.1 Manual that “A multiprocessor can execute as many as eight thread blocks concurrently.” That means that a GPU board with 16 SM units can execute a maximum of 128 (16 x 8) concurrent thread blocks. That means that if I have 8 GPU boards, with 16 SM units each, the “gap” that we were talking about before is now less than one order of magnitude (x8)! Better and better…

tera · April 7, 2011, 5:27pm

[removed, accidental double posting]

tera · April 7, 2011, 5:40pm

No, it was very interesting as it gives an idea what we are talking about.

Just look at one of the Tesla pages on the Nvidia website.

The two important metrics for the throughput of compute-bound kernels are the number of cores and the frequency of the cores, as each core performs at most two integer or floating point operations per cycle, no matter how they are organized into threads and blocks.

For memory bandwidth limited kernels on the other hand the one relevant metric is the memory bandwidth of course. I did not cite that as I started from the most optimistic assumptions and memory bandwidth of a GPU is only one order of magnitude larger that of a CPU.

The threads of these blocks will however still have to wait for cores to be available to actually perform their calculations, so a larger number of threads in flight does not increase the limit given above.

Gregory_Diamos · April 7, 2011, 10:28pm

Hi Billy,

The basic approach is to setup some zero-copy memory that is shared between the device and the host and then manually create an asynchronous message queue (head/tail pointers, data, etc) in it. You would use Lamport’s queuing operations to access it. There were a few research papers from John Owens[1] that used it to implement MPI on CUDA, but you can use it for pretty much any host-device communication. Once you have this set up, you can write a library on top of it that allows you to call something along the lines of queue.push(message) from a CUDA thread, and message = queue.pull() from a host thread.

You are right that you can’t remotely invoke a host function from the device via an indirect function call. You always have to ask a host thread to do it for you.

[1] - “Message Passing on Data-Parallel Architectures” http://www.idav.ucdavis.edu/publications/print_pub?pub_id=959

Sarnath · April 8, 2011, 7:26am

This is exactly how IBM Cell works. The SPE sends an interrupt (stop and signal instruction) after preparing a message for the host.

The host reads the C-API # and invokves it via a host thread…

Check “YetAnotherMonkey” post in the URL below…

http://www.ibm.com/developerworks/forums/thread.jspa?messageID=14023117&

roddomi · April 8, 2011, 4:56pm

This is an interesting proposal for an architecture that can do exactly that:

Billy_Rubin · April 10, 2011, 1:33am

The basic approach is to setup some zero-copy memory that is shared between the device and the host and then manually create an asynchronous message queue (head/tail pointers, data, etc) in it. You would use Lamport’s queuing operations to access it. There were a few research papers from John Owens[1] that used it to implement MPI on CUDA, but you can use it for pretty much any host-device communication. Once you have this set up, you can write a library on top of it that allows you to call something along the lines of queue.push(message) from a CUDA thread, and message = queue.pull() from a host thread.

You are right that you can’t remotely invoke a host function from the device via an indirect function call. You always have to ask a host thread to do it for you.

[1] - “Message Passing on Data-Parallel Architectures” http://www.idav.ucdavis.edu/publications/print_pub?pub_id=959

Since I started this thread, it’s only proper that I be the one to wrap it up, or at the very least, leave a (semi?) final note before I depart. Feel free to continue without me though…

The good news is that this thread has been one of the most intelligent and informative threads I’ve ever had the privilege of participating in. I would highly recommend it to anyone wishing to further their knowledge of not only CUDA programming, but parallelism in general, as the issues discussed here carry with them important implications for the future of parallelism that have far exceeded the initial subject matter.

Before I attempt to explain that rather ominous statement however, I’d like to get something off my chest before anyone reading this has a chance to get bored and move on to something else.

For the record, I’ve discovered what can only be described as deficiencies in the PTX ISA 2.1, which I humbly submit should be corrected in a future version of the ISA. They are:

The bar.sync instruction should no longer issue an “arrival” signal,
The condition code register should be preserved across branches,
The “mad” instruction should be enhanced to include an optional “.cc” qualifier, and
A “madc{.cc}” instruction should be added to the instruction set.

The reasoning for the first suggestion has been amply discussed in this thread, so there is no further need to elucidate it here, except to say that if backward compatibility is an issue, the symbolic assembler could silently emit a preceding bar.arrive instruction for previous versions.

Implementation of the second suggestion is an obvious enhancement for any large integer add or multiply, because a “uniform” branch is implicitly required to loop through the digits of any large operand. This would obviate the current requirement to save and restore the condition status in a destination register for every iteration of the loop, which is an obvious waste of valuable clock cycles.

Finally, implementation of the third and fourth suggestions would literally double the speed with which large integer multiplication could be performed on NVidia hardware.

Having gotten that off my chest, I’d like to move on to the really interesting stuff.

I recently got around to reading tera’s excellent link, “myths about GPU computing that sometimes need a bit of debunking”. That paper is chock full of information, but for me, one of the most important results of reading that paper is that it introduced me to the idea that there are two very different types of parallelism: DLP and TLP.

To make a long story short, I didn’t understand this very basic distinction until recently. Once I did, it became very obvious to me that attempting to implement the TLP part of any algorithm on the SIMD-based (a.k.a. DLP-based) hardware of a CUDA board is definitely not a good idea; at least not for any significantly large project. Which, to be fair, is basically what Sarnath said in his one-line post.

It could also be described (and yes, it pains me to say this) as an attempt at “ramming a square peg into a round hole”… Which sounds… um… vaguely familiar somehow… External Image

With this new insight freshly ingrained into my thickheaded psyche, I took another look at my algorithm. What I discovered is that it could be said to basically alternate between a DLP part, and a TLP part. The DLP part is the Calc_Something() call, and the TLP part is the Calc_Var() call.

So the idea would be to implement the DLP part on the CUDA board, and the TLP part on the host. But despite the fact that these parts are neatly separated into different functions, their separation into different threads was an entirely different matter, because they’re basically executed sequentially (hence the “alternating” cycle of DLP/TLP). At first glance, this problem didn’t seem to have an obvious solution.

But, then I read Gregory Diamos’ excellent contributions to this thread, about setting up a CUDA thread invocation queue (let’s call it a “CTI Queue”), which would be added to by a CUDA thread, and consumed by a host thread. Each queue entry would represent the future invocation of one (or more) CUDA threads. That’s when things really began to fall into place.

Not only will the system he describes work, but the host thread that polls the CTI Queue can be extended to include the entire TLP part of my algorithm. Or indeed, the entire TLP part of any algorithm containing reasonably distinct TLP and DLP parts.

Specifically, if every CTI Queue element was a structure containing { slot, n, m }, then instead of just blindly invoking CUDA threads in response to the queue’s contents, it could instead wait until each CUDA thread is “ready” to be executed, by simply adding the constraint that each CUDA thread will be invoked if and only if “Ready[ n ] >= m”.

Of course, this would necessitate the host thread also having to execute the computations contained in the TLP part of the algorithm, (i.e. the Do_Some_Stuff() function), but since that can be very easily delegated to either another queue, another synchronous host thread, or both, that’s really not a problem.

It’s the perfect solution… External Image

I could even see how this could be generalized into a host-based “TLP handler” which could, in addition to handling the CTI Queue, invoke each CUDA thread only if an optional host-based function returns a boolean true when proper synchronization was achieved (the requirement to synchronize being the hallmark of a TLP process). The address of that function could be looked up in a table from an index supplied in the queue entry itself.

Just thought I’d throw that last part out there - it seems like a farily good idea, but someone else has probably thought of it before…

Then, last but not least, roddomi provides us all with a link to a paper for an entire multiprocessor architecture that “can do exactly that” - i.e. it’s specifically built to enable the partitioning of the TLP parts of a parallelized process from the DLP parts (as far as I can tell)…

Well, that really clinched it for me - that’s the future of parallel computing, right there. Absolutely fascinating. And it’s seven years old !! Where can I buy a SCALE VT board ?? Is it on EBay yet ??

Anyway, that’s the good news. The bad news is that tera was basically right (it seems I’m the only one in this whole thread who seems to be consistently wrong). In the end, my project can’t be done using 8 Tesla boards and an Intel 6-core host, but I came close.

The reason for that can be explained quite easily, using what tera has called, basic “inside-the-box” thinking. Embarrassingly simple thinking actually…

First of all, the time it takes to complete one 315,653 decimal-digit multiplication for the one core of the 2.4 GHz dual-core processor can be approximated quite readily, no profiling required. It took one month to complete 8,386,560 multiplications serially, so each multiplication took 0.31 seconds - simple division.

That’s significantly longer than 1/100 of a second - 31 times longer actually.

Second, I misrepresented the problem to both tera and myself when I implied that the problem boiled down to performing 8,386,560 multiplications “concurrently”.

That just can’t happen. Why? Because each 315,653 decimal-digit operand occupies 32,768 DWORDs (32-bit unsigned integers), so their product must occupy twice that: 65,536 DWORDs.

So 8,386,560 “concurrent” multiplications would therefore require reserving memory for 8,386,560 x 65,536 DWORDs, or 549,621,596,160 DWORDs, or… wait for it… two terabytes of storage space…

Even divided into eight Tesla boards, that would require each board to have 256 Gigabytes of memory.

A big part of the beauty of my algorithm (and yes, I’m still proud of it), is that it gets around this huge memory limitation by performing concurrent multiplications in big “chunks”, where each “chunk” re-uses the same memory. Unfortunately, that means that each “chunk” of concurrent multiplications therefore has to be executed serially.

How I conveniently forgot that very salient fact is currently beyond my comprehension…

So assuming that each CUDA board is a Tesla with 6 Gig, then I could probably safely run 4 concurrent “chunks” per board, where each “chunk” would consist of 4,096 threads (multiplications), for a grand total memory requirement of 4 x 4,096 x 65,536 DWORDs = 4 Gigabytes per board, the remaining 2 Gigabytes left for the CUDA system internals (cache, etc.), and the Karatsuba function’s stack.

Interesting sidebar: Thanks entirely to Gregory Diamos’ ingenious CTI Queue idea, a “chunk”, which is somewhat analogous to a CUDA “stream”, is no longer limited to the maximum number of threads in a block, precisely because the synchronization has been offloaded to the host. A “chunk” could therefore be configured as a grid of 4 CUDA blocks, each with 1,024 threads, or a grid of 8 CUDA blocks, each with 512 threads, or whatever works best.

If I use 8 Tesla boards, I could therefore run 4 x 8 = 32 concurrent “chunks” at a time. However, if each “chunk” is 4,096 threads, the algorithm demands that 4,096 “chunks” be executed, so the complete calculation would take 4,096 / 32 = 128 times longer than time “T” (0.31 seconds).

So 8,386,560 multiplications would take approximately 0.31 x 128 = 39.68 seconds.

At that rate, to get a final answer, it would not have to run for just a year, it would have to run for 3,968 years…

And I don’t think I quite have the patience for that…

To be fair, I think that in actual practice, the 8,386,560 multiplications would probably take closer to 20 seconds, because only the first “chunk” actually performs 4,096 multiplications - the second chunk performs 4,095, the third, 4,094, the fourth, 4,093, and so on. Also, I think the Tesla board could possibly perform the actual multiplication in faster than 0.31 seconds, especially since all of the “barrier” code was obviated. Still, that’s a running time of 2,000 years instead of 3,968, so…

Thank you all for all your help. External Image As Thomas Edison might have said, I’ve officially discovered another way of not making a working light bulb…

Sylvain_Collange · April 10, 2011, 11:57am

I guess the synchronization model you have in mind is something like Monotonic Counters (http://ipdps.cc.gatech.edu/2000/papers/thornley377.pdf).

The model that PTX currently implements is obviously different, but is it inferior?

i.e., can you provide an use case for which Monotonic Counters would be more efficient/expressive than the sync/arrive model?

The PTX manual gives the example of a producer/consumer scenario:

/* Producer/consumer model. The producer deposits a value in

* shared memory, signals that it is complete but does not wait

* using BAR.ARV, and begins fetching more data from memory.

* Once the data returns from memory, the producer must wait

* until the consumer signals that it has read the value from

* the shared memory location. In the meantime, a consumer

* thread waits until the data is stored by the producer, reads

* it, and then signals that it is done (without waiting).

*/

   // Producer code places produced value in shared memory.

   st.shared   [r0],r1;

   bar.arrive 0,64;

   ld.global   r1,[r2];

   bar.sync    1,64;

   ...

   // Consumer code, reads value from shared memory

   bar.sync   0,64;

   ld.shared r1,[r0];

   bar.arrive 1,64;

   ...

The side effect is that consumers are synchronized with each other in addition to being synchronized with the producers, and the other way around. Not a problem here, but I can imagine there are cases where we do not want that…

I do not like the way condition codes are implemented in PTX either, but I believe following your suggestion would only make the matter worse.

The root of the problem is that CC is an implicit register in PTX: a part of the virtual machine state. So the add.cc instruction has side effects. And side effects are bad External Image.

Condition codes should just appear as another type of registers to the compiler and be copyable, so the compiler can perform all optimizations the usual way. Even if the actual machine only has one implicit CC register.

What would be the semantics of this new instruction?

If it is equivalent to mul.{lo,hi} followed by addc{.cc}, then such an instruction already exists in the actual Fermi ISA. The backend compiler should already use it whenever it sees the equivalent sequence in PTX.
If it generates a carry when the multiplication overflows, then there is no direct hardware implementation, but you can emulate it easily in software using mul.hi.

(I wish NVIDIA could stop calling PTX an ISA in the various documentations and papers… PTX is an intermediate language which does not have, and should not have, characteristics of a real ISA like specialized instructions or a limited set of registers.)

Sarnath · April 11, 2011, 5:21am

Billy,

I have not followed the thread closely… But if execution-time is your problem, can’t you hire some 1000 TESLA GPUs from Amazon?
Just blast it inside the cloud… if you have that parallelism…

tera · April 11, 2011, 8:24am

Maybe Billy does not have the 2e10 dollars to pay for it…

Sarnath · April 11, 2011, 8:57am

Nice catch… 2000 years is still huge…

OR may be, Billy just needs to wait until NVIDIA rents their exa-scale machine… OR may be maxwell and kepler could solve it.

Billy_Rubin · April 11, 2011, 8:18pm

I’m not saying that the “model that PTX currently implements” is “inferior”, I’m saying it’s broken.

The code example you provided makes my point quite readily. Just add a second consumer, and watch what happens:

// Producer code places produced value in shared memory.

   ...		// <== some lengthy process..

   st.shared   [r0],r1;

   bar.arrive 0,64;	// <== happens at time (T+2)..

   ld.global   r1,[r2];

   bar.sync    1,64;

   ...

   // Consumer (1) code, reads value from shared memory

   bar.sync   0,64;	// <== happens at time (T) - issues arrival at barrier 0 !!

   ld.shared r1,[r0];

   bar.arrive 1,64;

   ...

   // Consumer (2) code, reads value from shared memory

   bar.sync   0,64;

   ld.shared r1,[r0];	// <== happens at time (T+1) >> OOPS - NO DATA !! <<

   bar.arrive 1,64;

   ...

On the other hand, if the bar.sync instruction doesn’t issue an arrival signal, then the above would work just fine as written, because Consumer (2)'s bar.sync instruction would wait until time T+2, when the barrier had actually been arrived at.

In fact, the way it works now, even if both Consumer (1) and Consumer (2) were the same kernel executing in different warps, it would still screw up. And this is in the PTX manual !?!

In short, and as I’ve said before, it just doesn’t make logical sense that the bar.sync instruction should issue an arrival signal, because it’s not “arriving” at anything - it’s waiting on an arrival. Why can’t anyone (except possibly tera) understand that?

Waiting for a flight at the airport (bar.sync) is not equivalent to landing the plane (bar.arrive).

Sort of, but not really.

That paper is obviously specific to multiprocessor architectures that assume the ability of a thread to independently suspend their own execution indefinitely, until such time as another thread decides to “resume” them. From the paper itself:

“c.Check(level) atomically compares c.value to level and suspends until c.value >= level.”

“c.Increment(amount) atomically increments c.value by amount, thereby reawakening all c.Check operations suspended on values less than or equal to the new c.value.”

That strongly implies its suitability for pre-emptive MPMD architectures (such as Win32), not SIMD or SPMD architectures. With the latter, at least in CUDA, there is no mechanism for a thread to independently suspend its own execution indefinitely, because resumption of a suspended thread is always tied to a pre-defined hardware event of some kind, such as a barrier, not a data-dependent software decision (such as Win32’s ResumeThread() function). In fact, the whole thing seems to me to be little more than a software optimization of existing MPMD synchronization methods.

So that particular paper doesn’t really address the implementation of what it calls “Monotonic Counters” to a more restrictive SPMD system, such as CUDA.

On the other hand, my implementation of the Ready state variable array, which employs the PTX barrier mechanism to achieve thread suspension, does.

So to avoid confusion, I think I’ll call my implementation the “Barrier Arrival Ready Requirement Accumulator”, or “Barra” for short. I’m thinking of copyrighting the term. What do you think, Sylvain? External Image

You’re right. It would be far better if the “condition code”, which would include, but not necessarily be limited to, the carry status, were an accessible register. I was just trying to minimalize my suggestions to whatever I thought could most easily be implemented by the NVidia hardware people.

However, in trying to work up an example piece of code to illustrate my second suggestion, I realized that in most cases, saving and restoring the carry status has to be done in every iteration of the loop anyway, because indexing each operand’s digits will almost always require the use of the add instruction.

So I officially rescind my second suggestion, in favor of our mutually agreeable recommendation that the condition code should be an accessible register. At least that way, it could be saved and restored more optimally. In the x86 world, you can use either pushf/popf or lahf/sahf, so it’s really easy and simple.

As for my third and fourth suggestions, they would work as equivalent to mul.wide, followed by addc.cc.

But my comments that both of these suggestions would “double the speed with which large integer multiplication could be performed” are still applicable, as they are based on a rather important footnote that I found in tera’s “debunking” paper link, which states:

“The peak single-precision SIMD Flops for GTX280 is 311.1 and increases to 933.1 by including fused multiply-add”.

From that, I figured it was safe to infer that using “fused multiply-add” (mad) is at least twice as fast as not using it. So yes, you can use “mul.{lo,hi} followed by addc{.cc}” as you suggested, but it would be better, for strictly performance reasons, to be able to use the single mad instruction instead.

I think the cost-benefit ratio of doing so would be very high. Adding that capability would be a relatively small improvement to the mad instruction, but it would at least double the speed of an entire class of algorithms. Without that improvement, large integer multiplies simply cannot use it, because the carry state is lost.

Gregory_Diamos · April 11, 2011, 9:24pm

Maybe I’m missing something, but wouldn’t this work?

mov.u32 r3, %ntid.x; // get the total number of threads

// Producer code places produced value in shared memory.

   ...		// <== some lengthy process..

   st.shared   [r0],r1;

   bar.arrive 0,r3;	

   ld.global   r1,[r2];

   bar.sync    1,r3;

   ...

   // Consumer (1) code, reads value from shared memory

   bar.sync   0,r3;	

   ld.shared r1,[r0];

   bar.arrive 1,r3;

   ...

   // Consumer (2) code, reads value from shared memory

   bar.sync   0,r3;

   ld.shared r1,[r0];	

   bar.arrive 1,r3;

   ...

Billy_Rubin · April 11, 2011, 11:27pm

Maybe I’m missing something, but wouldn’t this work?

mov.u32 r3, %ntid.x; // get the total number of threads

// Producer code places produced value in shared memory.

   ...		// <== some lengthy process..

   st.shared   [r0],r1;

   bar.arrive 0,r3;	

   ld.global   r1,[r2];

   bar.sync    1,r3;

   ...

   // Consumer (1) code, reads value from shared memory

   bar.sync   0,r3;	

   ld.shared r1,[r0];

   bar.arrive 1,r3;

   ...

   // Consumer (2) code, reads value from shared memory

   bar.sync   0,r3;

   ld.shared r1,[r0];	

   bar.arrive 1,r3;

   ...

Yes, that specific case would work, because each consumer would wait until all threads had arrived at the barrier, including the Producer thread.

BTW, you could have just used “bar.sync 0”, but you probably already know that…

However, should every consumer have to wait for every other consumer to be ready to use the data, before it can use the data itself? What if there are 1,000 consumers, and all each wants to do with the data is check its value against some limit, and proceed accordingly? They would all have to wait until all other consumers are ready to check the value, before even one of them could proceed.

That seems like a horrible waste of available processor time. Perhaps not within a single warp, because they’re all supposed to be executing in lockstep anyway, but certainly within the threads of an entire block. Wouldn’t it be far better if each consumer could just wait on the one Producer thread (or warp), and proceed when the data has been written?

Your example also indirectly illustrates the questionable usefulness of the bar.sync’s new thread count parameter, because using anything less than the total number of threads in the block would definitely not work. Why? Because the Producer thread’s bar.arrive could still be the last arrival signal to occur.

And it’s that last point that leads me to believe that the bar.sync’s generation of an arrival signal is what is commonly referred to as “legacy code”. It’s unnecessary, needlessly compromises performance, and renders the bar.sync instruction’s new thread count parameter almost completely useless.

Or, looking at it from the “glass half full” side of things, removing the arrival signal from the operation of the bar.sync instruction would make the whole barrier mechanism far more useful, flexible, and understandable.

Thanks for pointing this out. My previous comments obviously didn’t make allowances for this special case.

Sylvain_Collange · April 12, 2011, 11:00am

OK, I sort-of see what you mean. I still won’t say that CUDA’s model is “broken”, but it looks indeed needlessly complicated because of legacy issues. Unless we get some real examples and/or rationale.

Okay, but how would you implement the PTX barrier mechanism in hardware, except by using some sort of monotonic counters?

Hey, wait, you can’t do that! :O

Yes, but only the addc instruction updates the CC register. You can still mix the addc.cc sequence with regular adds, so your idea does work.

But it is not a hardware issue. PTX is not the instruction set.

Actually, if you take a look at the Tesla ISA (pre-Fermi) using decuda or cuobjdump, you will see that it is just like what you propose. It has several condition code registers, and each of them contains the usual Zero, Negative, Overflow and Carry flags. Every arithmetic instruction (including FP) can write to one CC register. You can copy the CC registers between each other and to/from general-purpose registers, and you can predicate instructions (including branches) with a CC register and a condition (e.g., Negative or Zero).

Fermi has dropped some of these features, probably for valid reasons. But the role of PTX is to abstract the details of both architectures.

It makes no sense trying to infer performance issues from PTX instructions, as Avidday told you earlier in this thread. PTX is a high-level language! It is fed to the ptxas compiler, which performs lots of optimizations on it, including replacing sequences of PTX instructions by a single assembly instruction, or one PTX instruction by a sequence of assembly instructions, as well as instruction reordering and copy-propagation.

The madc.cc instruction you describes does exist in both Tesla and Fermi ISAs. If sequences of mul+addc.cc in PTX are not translated to madc.cc instructions, it should likely be considered as a compiler bug.

Topic		Replies	Views
Global thread barrier CUDA Programming and Performance	78	86098	December 23, 2011
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10107	February 29, 2008
Controlling context switching in CUDA CUDA Programming and Performance	16	4532	April 29, 2013
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	205140	April 13, 2009
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23463	October 7, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	135692	May 26, 2010
'Computations server' application design advice CUDA Programming and Performance	24	12846	March 23, 2007
Synchronization methods? CUDA Programming and Performance	11	2218	November 7, 2010
Bakery's Alg leads to device crash thread synchronization CUDA Programming and Performance	15	19434	March 9, 2007
KERNELS are NOT queing , bug in cuda 2.0 ? cudathreadsynchronize() makes no difference ? CUDA Programming and Performance	12	5404	August 17, 2009

CUDA Kernel self-suspension ? Can a CUDA Kernel conditionally suspend its execution ?

Related topics