Function Calls in 3.1beta

So I have some questions about the low level implementation of function calls in cuda3.1-beta. Before I go into details though, I’m wondering whether or not it is okay to discuss the details of the generated code from the beta toolkit on this forum, or if I should be PMing someone instead.

If no one knows whether or not it is okay to discuss details about the beta on here I am just going to ask and then ask for forgiveness afterwards if it turns out otherwise.

My question is related to the ABI for function calls.

I think it’s probably A-OK! However, I don’t know anything about the ABI and can’t help.

Not a problem, even if you can’t answer directly I think it still might help to see what other people think.

So the short version of my question is whether or not there are restrictions on divergent virtual function calls?

The long version is as follows:

So it looks like CUDA 3.1 finally supports recursion!

See the following example of a loop implemented with recursion:

extern "C" {

__device__ unsigned int counter;

__device__ unsigned int loop( unsigned int input )

{

	if( input != 0 )

	{

		++counter;

		return loop( input - 1 );

	}

	return 0;

}

}

The extern “C” is only there to prevent name mangling to make the following PTX easier to read.

.global .u32 counter;

	.visible .func (.param .u32 __cudaretf_loop) loop (.param .u32 __cudaparmf1_loop)

	{

	.reg .u32 %r<8>;

	.reg .pred %p<3>;

	.loc	16	8	0

$LBB1_loop:

	ld.param.u32 	%r1, [__cudaparmf1_loop];

	mov.s32 	%r2, %r1;

$Lt_0_1026:

 //<loop> Loop body line 8

	mov.u32 	%r3, 0;

	setp.eq.u32 	%p1, %r2, %r3;

	@%p1 bra 	$Lt_0_1282;

 //<loop> Part of loop body line 8, head labeled $Lt_0_1026

	.loc	16	11	0

	ld.global.u32 	%r4, [counter];

	add.u32 	%r5, %r4, 1;

	st.global.u32 	[counter], %r5;

	sub.u32 	%r2, %r2, 1;

	bra.uni 	$Lt_0_1026;

$Lt_0_1282:

	.loc	16	14	0

	mov.u32 	%r6, 0;

	st.param.u32 	[__cudaretf_loop], %r6;

	ret;

$LDWend_loop:

	} // loop

Now note the following interesting points:

  1. This example is eligible for tail recursion, the compiler detects this and converts it into a loop rather than a series of calls.

  2. Stores to parameter memory are used to return results from function calls.

The next example shows that the compiler can handle real recursion without faking it with a loop;

__device__ unsigned int stack( unsigned int n )

{

	if( n == 0 ) return 0;

	return stack( n - 1 ) + n;

}
.visible .func (.param .u32 __cudaretf_stack) stack (.param .u32 __cudaparmf1_stack)

	{

	.reg .u32 %r<10>;

	.reg .pred %p<3>;

	.param .u32 __cudaparma1_stack;

	.param .u32 __cudareta_stack;

	.loc	16	18	0

$LBB1_stack:

	ld.param.u32 	%r1, [__cudaparmf1_stack];

	mov.s32 	%r2, %r1;

	mov.u32 	%r3, 0;

	setp.ne.u32 	%p1, %r2, %r3;

	@%p1 bra 	$Lt_1_1026;

	.loc	16	19	0

	mov.u32 	%r4, 0;

	bra.uni 	$LBB5_stack;

$Lt_1_1026:

	.loc	16	20	0

	sub.u32 	%r5, %r2, 1;

	st.param.u32 	[__cudaparma1_stack], %r5;

	call.uni (__cudareta_stack), stack, (__cudaparma1_stack);

	ld.param.u32 	%r6, [__cudareta_stack];

	mov.s32 	%r7, %r6;

	add.u32 	%r4, %r7, %r2;

$LBB5_stack:

	mov.s32 	%r8, %r4;

	st.param.u32 	[__cudaretf_stack], %r8;

	ret;

$LDWend_stack:

	} // stack

Note the explicit call to “stack” in the body.

Okay, so that is relatively straight forward, I assume that param memory is actually mapped to global memory and used for the stack. Now for the fun part.

typedef unsigned int (*Function)( unsigned int );

__device__ unsigned int divergentCall( )

{

	Function f = 0;

	if( threadIdx.x == 0 )

	{

		f = stack;

	}

	else

	{

		f = loop;

	}

	

	return (*f)( threadIdx.x );

}

This code conditionally sets a function pointer depending on the thread ID. The generated code assigns function pointers to registers so that when the call is actually made, it is from the value stored in %r6, which will vary across threads in the same warp.

.visible .func (.param .u32 __cudaretf__Z13divergentCallv) _Z13divergentCallv ()

	{

	.reg .u32 %r<11>;

	.reg .pred %p<3>;

	.param .u32 __cudaparma1_fproto_Ui_Ui;

	.param .u32 __cudareta_fproto_Ui_Ui;

	.loc	16	23	0

$LDWbegin__Z13divergentCallv:

	.loc	16	34	0

	mov.u32 	%r1, %tid.x;

	mov.s32 	%r2, %r1;

	st.param.u32 	[__cudaparma1_fproto_Ui_Ui], %r2;

	mov.u32 	%r3, _Z5stackj;

	mov.u32 	%r4, _Z4loopj;

	mov.u32 	%r5, 0;

	setp.eq.u32 	%p1, %r1, %r5;

	selp.u32 	%r6, %r3, %r4, %p1;

fproto_Ui_Ui: .callprototype (.param .u32 _) _ (.param .u32 _);

	call (__cudareta_fproto_Ui_Ui), %r6, (__cudaparma1_fproto_Ui_Ui), fproto_Ui_Ui;

	ld.param.u32 	%r7, [__cudareta_fproto_Ui_Ui];

	mov.s32 	%r8, %r7;

	mov.s32 	%r9, %r8;

	st.param.u32 	[__cudaretf__Z13divergentCallv], %r9;

	ret;

$LDWend__Z13divergentCallv:

	} // _Z13divergentCallv

This is a divergent function call, resulting in some threads calling into one function and the others calling into another. Are there any limits on the number of calls that are supported by this mechanism? Supporting this with the classical post dominator warp reconvergence scheme is fine as long as there are only two possible targets of a divergent branch (or call in this case). It seems like it requires new hardware support if there are multiple targets (you would have to create contexts for up to 32 different functions in a single instruction). This could also be handled in ptxas by detecting divergence and then serializing calls, but there would be a pretty stiff performance penalty.

So I have two questions:

  1. Will it work if all threads in the same warp call different function pointers?

  2. If so, is there hardware support or will there be a performance penalty beyond the cost of simply serializing execution of each function?

If I had to make a guess, I would say that it is implemented the same way as conflicts in constant cache and shared memory are handled.

When a call instruction is encountered:

  1. Push the return address and the current execution mask into the stack.

  2. Look at target addresses of all enabled threads.

If they are all the same, jump to that address.

If they are not:

  • Arbitrarily select one target address T.

  • Build a mask M, having a bit set for each thread which target address is T.

  • Push token (T, M) into the stack.

  • Disable threads in M from the current mask.

  • Replay the call instruction with updated mask and a “please don’t push the return address into the stack again” tag, so go back to 2.

(so the call instruction is recursive itself :) )

So with 4 threads, and a call to 3 different target addresses a, b, c:

call (a,b,a,c)

r: next instruction

may push the following into the stack (bottom to top):

(b, 0100)

(a, 1010)

(r, 1111)

then jump to c with mask 0001.

So that the return instruction from function c will jump directly to function b, then from b to a, and then back to r.

Just guessing, of course. ;)

I am sorry, am I right, that function call uses global memory (and L1-L2 cache therefore) to store parameters?

It’s not documented, but it seems most likely… performance would be fine in all common cases (because of the cache) but not limited (because of the ability to dump stack data to device memory if needed.)

But what is about stack size? If it is big, 15000 active threads would consumer a lot of space.

Yes… a huge amount in some worst-case scenario with deep recursion and lots of arguments. Exactly what the limits are is undocumented, though they might be pretty modest since deep call stacks are rare in CUDA.

I would guess (just from thinking, not experiments) that the simplest design would use local memory to give each thread a fixed sized stack space and once you use it up, bad things happen.

You could make an experiment with a 15K thread kernel invoking an N-deep recursive call and increase N to see how high it can go before failure. Then repeat with a single thread to see if the same N is still a limit (in which case the per-thread fixed buffer size is likely.)

There’s an API call that lets you set the kernel’s stack size from the host app… it’s cudaThreadSetLimit(cudaLimitStackSize,…) for the Runtime API and cuCtxSetLimit(CU_LIMIT_STACK_SIZE,…) for the driver API. Note that in CUDA 3.1, if you’re going to change the stack size, you have to do it before launching any kernels. Also note that recursion support requires compute capability 2.0.

–Cliff

Any performance tests and expirence?

I’ll comment on this in more detail in about a month after the implementation in Ocelot is relatively stable and I can run some microbenchmarks.

As an aside, I am curious about the choice to reuse the parameter memory space for storing function arguments. Before this, the parameter memory space was used only for kernel arguments, which were invariant across all threads. Function arguments on the other hand can have different values for different threads, requiring separate storage for all active threads. In ocelot, this is annoying to deal with because actually placing all parameters in the same memory space requires duplicating kernel arguments for all active threads or splitting the address space into an invariant space (for kernel arguments) and variant space (for function arguments). It would have been less ambiguous to either introduce a new memory space or reuse the local memory space for function arguments, which already is allocated on a per-thread basis.

Thanks for the possible explanation, I would agree that it could definitely be implemented this way. In the fully divergent case I can’t think of a way to generate the predicate stack in a single cycle.

  1. seems like a very heavy weight operation compared to a taken/not-taken branch. The comparison is among up to 32 function pointers rather than just checking if all of the bits in two 32-bit predicate masks are the same. I suppose that the same check is being done for memory accesses, and that the common case could be done by reusing the ALUs in a single pass through the pipeline.

Someone should really dig into this with micro-benchmarks.

I’m really interested in what benchmarks will show.

I have a preliminary version of this finished for Ocelot ( source here in case anyone wants to try it out http://code.google.com/p/gpuocelot/source/…/ocelot-ptx-2.1 ).

Sylvain, I am curious if you have been able to get decuda ported to sm_21 binaries? I would like to see how function arguments and return values are handled in the stack.

Looks nice!

Unfortunately, I didn’t do anything about sm_2x…

The closest we have currently is [topic=“172577”]this[/topic]. nvc0dis should work, unless SM 2.1 happens not to be binary compatible with 2.0… But it does not seem to support indirect calls (they are not that common in graphics shaders…)

I would assume that stack management works like most ABIs on RISC CPUs: by using one or two registers as frame and/or stack pointers, several registers for fast parameter passing, a callee-save or caller-save convention for the other registers… Then the stack would just be an area in local memory indexed by the frame pointer, where registers are spilled.

Lots of things to try… Too bad I am busy writing my dissertation. External Media

Thanks, I’ll give it a try.

The main thing that I am wondering about is whether the stack frame is allocated as one big frame for all threads or managed on a per-thread basis. In ocelot, I implemented it as one big frame with space for all threads with the expectation that this would simplify pushing/popping frames (one SP update per unique call/return). But after going back, it seems like this could greatly increase the total space required in highly divergent cases.

I’m also curious to see how the spill area is handled in cases with large numbers of function arguments.

It is a huge time-sink isn’t it? I need to start working on mine again at some point… What is your topic btw?

The stack is stored in local memory, so address translation is up to the local-to-physical address mapping mechanism. Local memory is allocated statically for all threads (adjustable using cudaThreadSetLimit).

Given the way branch divergence is (may be?) implemented, the stack pointers of all active threads in a warp should stay synchronized. (Inactive threads are just masked off and can be ignored. And they will reconverge at the same stack depth they diverged.)

So even if stack may seem to be managed on a per-thread basis, it is more per-warp as a consequence of SIMT execution.

It is probably a good thing:

The L1 cache line size (128B) matches a warp-sized vector of 32-bit values. So all threads in a warp should always access the same cache line when spilling or restoring one (32-bit) local variable.

Compacting the stack to make it smaller sounds nice.

If we had “vote.count” and “vote.scan” instructions to respectively count the number of active threads in the warp, and get a thread ID among only active threads in the warp, we could dynamically adjust the “width” of each stack frame.

In sloppy pseudo-PTX:

function f:

  // Compute stack width for current frame

  vote.count width

  vote.scan tid

  // Push some variables

  st.global.b8 [sp - width + tid], a

  st.global.b8 [sp - 2 * width + tid], b

  st.global.b8 [sp - 3 * width + tid], c

  // Recursive call

  // Allocate stack frame for 4 variables

  mad.u64 sp, -width, 4, sp

  @p br skip

  call f

skip:

  // Don't forget to restore width and tid!

  vote.count width

  vote.scan tid

  // Restore stack frame

  mad.u64 sp, width, 4, sp

// Pop variables

  ld.global.b8 a, [sp - width + tid]

  ...

The problem is in saving and restoring ‘width’ and ‘tid’: it needs to be done at the same nesting level (so here I cannot put stack frame calculation inside the ‘skipped’ block). Storing them into the stack would likely defeat the purpose of saving space. But it might be doable with proper hardware support.

However we would lose stack alignment, so it is not a clear win…

Anyway, back to work…

There is something about using exotic features of GPUs (hardware transcendentals, texture filtering, attribute interpolation, static rounding modes…) in GPGPU apps. And something about taking advantage of regularity (coherence between threads) in data-parallel apps to make GPU-like architectures more efficient (like detecting redundant computations and storage).

Please do not ask me what the relation is. ;)

I think we have (well, not me, but lucky owners of Fermi-based devices)

vote.count:

	vote.ballot.u32  r1, 1

	popc.b32		 width, r1

vote.scan:

	vote.ballot.u32  r1, 1

	and.b32		  r1, r1, %lanemask_lt

	popc.b32		 tid, r1

although I’m not entirely sure vote.ballot works the way I think.