Dynamic Branching in CUDA

dany_dan · September 30, 2008, 9:08am

Hello all,

Is it possible to use dynamic branching in device code ? For example is it possible to jump to an address computed dynamically ? And more importantly for me, is it possible to write some data to an address (on the device) and then transfer control at that address ?

If I understand the documentation correctly, I can’t do this directly in PTX since ptxas only accepts labels as branch and call targets, and indirect jumps through registers are unimplemented. I still think this would be possible by tweaking the cubin file.

Any ideas / experience on this kind of stuff ?

Thanks,
Dan

Reimar · September 30, 2008, 10:12am

I don’t think you can write to the memory area where the program is stored.

And looking at the decuda output, it may well be that the hardware can only jump to labels, though information about that kind of lowlevel stuff seems quite rare.

alex_dubinsky · September 30, 2008, 7:34pm

Dynamic branching will probably arrive when function calls, recursion, etc. is implemented. A few people use function pointers… CUDA needs to support those.

As for self-modifying code… you’re really asking for a lot.

sunsetquest · December 26, 2008, 9:54pm

question for the cuda team: Is there an ETA for the indirect jump (or dynamic function) (or jump table) implantation?

alex_dubinsky · December 27, 2008, 3:03am

It can’t come before a major revision of the hardware. So at least one year from now (and if not then, then three years). Unfortunately, nvidia never gives roadmaps that far ahead. (But maybe if they can comment if they’re trying to enable it at all.)

I wonder if this is something that was not included simply because DX10 doesn’t ask for it, or if there are major difficulties (eg regarding warp diversion) toward implementing it.

E.D_Riedijk · December 27, 2008, 7:18am

A slide from NVISION called GPU 2013 has a timeline with the following items (suggesting the first thing will be done first)

C++ (on another slide also Fortran was mentioned together with complete C++ support)

Preemption

Complete pointer support

Virtual pipeline

Adaptive workload partitioning

There were some slides of a roadmap for CUDA & OpenCl from NVIDIA to be found online.

2.1 : Q4 2008

2.2 : Q1 2009

2.3 : mid 2009

3.0 : Q4 2009

So I would not expect complete pointer support before beginning 2010, and to be honest probably even later, depends if 2.3 gets preemption already.

[EDIT typo in the year of 3.0 ;)]

alex_dubinsky · December 27, 2008, 5:35pm

I found the slide but it didn’t explain anything further. I am guessing:

Preemption must be multitasking for kernels. (Multiple kernels running, timesliced, like on a CPU)
Complete pointer support must be function pointers, recursion, etc.
Adaptive workload partitioning is multiple kernels running simultaneously, and each gets an MP for itself

But what is “virtual pipeline”?

seibert · December 27, 2008, 6:22pm

I also wondered about that… Could it be some kind of (virtual) inter-MP communication link that would allow you to break up a calculation into a sequence of kernels running on different multiprocessors? As results finished in one block in kernel A, they would be passed to a block in kernel B. Perhaps this is an organizational technique to allow one to make use of hundreds of MPs effectively, even when your dataset is not wide enough to span that many simultaneous, identical blocks.

Does that even make sense? (I’m free-associating here, hoping this triggers a more plausible idea in someone else.)

alex_dubinsky · December 27, 2008, 7:45pm

ooo that’s a good idea. That’d be copying how the Cell often works.

(Whether this is something that people would have a use for is a whole other question. It does pretty much what a stream is meant to do, if you include the “GPU partitioning” feature. The pipeline just seems like it’d be more work and thinking to do, and you’d have to manually scale it as the MP counts on the GPU change. I think it’s a Cell-ism that’s more suited to that architecture.)

Edit: But maybe what it really is is just an inter-MP synchronization primitive (that could be used for a pipeline). That’s interesting, but again, probably more a headache than a solution. In the end, it’d only give you a small efficiency boost over streams (considering streams have to be managed from the host side). Better to just fold it into the concept of streams (ie, make a loop construct in the stream paradigm, an automatic while(){…some stream actions…}, ie a queueInStream(a-conditional-backward-jump)).

E.D_Riedijk · December 29, 2008, 9:36am

[url=“http://www.techpowerup.com/img/08-12-27/42a.gif”]http://www.techpowerup.com/img/08-12-27/42a.gif[/url] suggests that CUDA 3.0 will be released with the upcoming GT300 architecture. So it looks like by this time next year we will again have a new compute capability.

alex_dubinsky · December 29, 2008, 6:15pm

Looks like we may have something new with the GT212 in 6 months.

But surprisingly we won’t have laptop G200s for 9 months, although at that point there’ll also be a motherboard-integrated G200, which is cool.

E.D_Riedijk · December 29, 2008, 7:23pm

And GT212 should be 40 nm already…

Bit disappointed about the laptop chips, but then again, I should not need one until end 2009, but it would be nice to test the performance before the moment of demo ;)

Topic		Replies	Views
Transfer data between host and device dynamicly? Maybe it's a problem. CUDA Programming and Performance	12	5268	April 2, 2008
Dynamic Kernel Function Runtime code generation CUDA Programming and Performance	17	25677	March 26, 2013
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2233	January 18, 2023
CUDA 4.0 CUDA Programming and Performance	63	507397	March 28, 2013
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	3941	December 21, 2016
why CUDA 2.0 does not expose all PTX ISA 1.3 ? CUDA Programming and Performance	20	27722	November 5, 2008
How does reducing unrolling or branching code actually reduce instruction fetch? CUDA Programming and Performance	16	2725	December 4, 2016
From NIC to GPU. CUDA Programming and Performance	40	13599	February 12, 2011
Method to Cycle steal DMA write into DDR5 CUDA Programming and Performance	7	902	December 8, 2017

Dynamic Branching in CUDA

Related topics