PTX function with return type vector?

EDIT: SORRY I REALIZED THIS IS THE WRONG FORUM, AND MADE A SIMILAR POST IN THE DEVELOPMENT FORUM INSTEAD. ANY MODERATOR, FEEL FREE TO DELETE THIS TOPIC!

According to the PTX ISA 2.0 p.49, it is possible to create a function (.func), where the return parameter is a vector base-type. However, I’m having some problem with this, where the system simply locks up.

If I make a function call, and store the result in a 2D vector, the result comes out fine, without any problems. However, if I pass this resulting vector (even if I copy it to a new register) to another function, the system crashes when I try to access to .y component of the vector.

Following is some code that works. The __cuda__cout_global variable is a global variable that holds an output address. The function fun simply returns its input.

In the following code, a vector {4,0} is passed to function fun, and the y-component of the return variable is sent to output, which works fine.

.reg .b64 __cuda__cout_global;

.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){

	mov.v2.b32 %rval, in;

	ret;

}

.entry worker (

	.param .u64 __cudaparam__cin,

	.param .u64 __cudaparam__cout){

	.reg .v2 .b32 x;

	.reg .v2 .b32 %tmp<2>;

	ld.param.u64 __cuda__cin_global, [__cudaparam__cin];

	ld.param.u64 __cuda__cout_global, [__cudaparam__cout];

	mov.b32 %tmp0.x, 4;

	mov.b32 %tmp0.y, 0;

	call (x), fun, (%tmp0);

	mov.v2.b32 %tmp1, x;

	st.global.b32 [__cuda__cout_global], x.y;

}

However, in the following example, I make an extra function call to a function, cout, that simply stores its input in the global output. In this case the system hangs. I’ve simply placed the st.global.b32 instruction in a function, cout.

.reg .b64 __cuda__cout_global;

.func () cout (.reg .v2 .b32 val){

	st.global.b32 [__cuda__cout_global], val.y;

}

.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){

	mov.v2.b32 %rval, in;

	ret;

}

.entry worker (

	.param .u64 __cudaparam__cin,

	.param .u64 __cudaparam__cout){

	.reg .v2 .b32 x;

	.reg .v2 .b32 %tmp<2>;

	ld.param.u64 __cuda__cin_global, [__cudaparam__cin];

	ld.param.u64 __cuda__cout_global, [__cudaparam__cout];

	mov.b32 %tmp0.x, 4;

	mov.b32 %tmp0.y, 0;

	call (x), fun, (%tmp0);

	mov.v2.b32 %tmp1, x;

	call cout, (%tmp1);

}

Am I missing something completely, or is there some bug in PTX2.0?

Any help is much appreciated!