Dynamic Branching in CUDA

Hello all,

Is it possible to use dynamic branching in device code ? For example is it possible to jump to an address computed dynamically ? And more importantly for me, is it possible to write some data to an address (on the device) and then transfer control at that address ?

If I understand the documentation correctly, I can’t do this directly in PTX since ptxas only accepts labels as branch and call targets, and indirect jumps through registers are unimplemented. I still think this would be possible by tweaking the cubin file.

Any ideas / experience on this kind of stuff ?

Thanks,
Dan

I don’t think you can write to the memory area where the program is stored.

And looking at the decuda output, it may well be that the hardware can only jump to labels, though information about that kind of lowlevel stuff seems quite rare.

Dynamic branching will probably arrive when function calls, recursion, etc. is implemented. A few people use function pointers… CUDA needs to support those.

As for self-modifying code… you’re really asking for a lot.

question for the cuda team: Is there an ETA for the indirect jump (or dynamic function) (or jump table) implantation?

It can’t come before a major revision of the hardware. So at least one year from now (and if not then, then three years). Unfortunately, nvidia never gives roadmaps that far ahead. (But maybe if they can comment if they’re trying to enable it at all.)

I wonder if this is something that was not included simply because DX10 doesn’t ask for it, or if there are major difficulties (eg regarding warp diversion) toward implementing it.

A slide from NVISION called GPU 2013 has a timeline with the following items (suggesting the first thing will be done first)

C++ (on another slide also Fortran was mentioned together with complete C++ support)

Preemption

Complete pointer support

Virtual pipeline

Adaptive workload partitioning

There were some slides of a roadmap for CUDA & OpenCl from NVIDIA to be found online.

2.1 : Q4 2008

2.2 : Q1 2009

2.3 : mid 2009

3.0 : Q4 2009

So I would not expect complete pointer support before beginning 2010, and to be honest probably even later, depends if 2.3 gets preemption already.

[EDIT typo in the year of 3.0 ;)]

I found the slide but it didn’t explain anything further. I am guessing:

Preemption must be multitasking for kernels. (Multiple kernels running, timesliced, like on a CPU)
Complete pointer support must be function pointers, recursion, etc.
Adaptive workload partitioning is multiple kernels running simultaneously, and each gets an MP for itself

But what is “virtual pipeline”?

I also wondered about that… Could it be some kind of (virtual) inter-MP communication link that would allow you to break up a calculation into a sequence of kernels running on different multiprocessors? As results finished in one block in kernel A, they would be passed to a block in kernel B. Perhaps this is an organizational technique to allow one to make use of hundreds of MPs effectively, even when your dataset is not wide enough to span that many simultaneous, identical blocks.

Does that even make sense? (I’m free-associating here, hoping this triggers a more plausible idea in someone else.)

ooo that’s a good idea. That’d be copying how the Cell often works.

(Whether this is something that people would have a use for is a whole other question. It does pretty much what a stream is meant to do, if you include the “GPU partitioning” feature. The pipeline just seems like it’d be more work and thinking to do, and you’d have to manually scale it as the MP counts on the GPU change. I think it’s a Cell-ism that’s more suited to that architecture.)

Edit: But maybe what it really is is just an inter-MP synchronization primitive (that could be used for a pipeline). That’s interesting, but again, probably more a headache than a solution. In the end, it’d only give you a small efficiency boost over streams (considering streams have to be managed from the host side). Better to just fold it into the concept of streams (ie, make a loop construct in the stream paradigm, an automatic while(){…some stream actions…}, ie a queueInStream(a-conditional-backward-jump)).

[url=“http://www.techpowerup.com/img/08-12-27/42a.gif”]http://www.techpowerup.com/img/08-12-27/42a.gif[/url] suggests that CUDA 3.0 will be released with the upcoming GT300 architecture. So it looks like by this time next year we will again have a new compute capability.

Looks like we may have something new with the GT212 in 6 months.

But surprisingly we won’t have laptop G200s for 9 months, although at that point there’ll also be a motherboard-integrated G200, which is cool.

And GT212 should be 40 nm already…

Bit disappointed about the laptop chips, but then again, I should not need one until end 2009, but it would be nice to test the performance before the moment of demo ;)