Problem with PTX function with vector input and return type

I previously posted this question on the general GPU computing forum, but realized that was probably the wrong place to ask the question, so I’m moving it here instead.

According to the PTX ISA 2.0 p.49, it is possible to create a function (.func), where the return parameter is a vector base-type. However, I’m having some problem with this, where the system simply locks up.

If I make a function call, and store the result in a 2D vector, the result comes out fine, without any problems. However, if I pass this resulting vector (even if I copy it to a new register) to another function, the system crashes when I try to access to .y component of the vector.

Following is some code that works. The __cuda__cout_global variable is a global variable that holds an output address. The function fun simply returns its input.

In the following code, a vector {4,0} is passed to function fun, and the y-component of the return variable is sent to output, which works fine.

.reg .b64 __cuda__cout_global;

.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){

	mov.v2.b32 %rval, in;

	ret;

}

.entry worker (

	.param .u64 __cudaparam__cin,

	.param .u64 __cudaparam__cout){

	.reg .v2 .b32 x;

	.reg .v2 .b32 %tmp<2>;

	ld.param.u64 __cuda__cin_global, [__cudaparam__cin];

	ld.param.u64 __cuda__cout_global, [__cudaparam__cout];

	mov.b32 %tmp0.x, 4;

	mov.b32 %tmp0.y, 0;

	call (x), fun, (%tmp0);

	mov.v2.b32 %tmp1, x;

	st.global.b32 [__cuda__cout_global], x.y;

}

However, in the following example, I make an extra function call to a function, cout, that simply stores its input in the global output. In this case the system hangs. I’ve simply placed the st.global.b32 instruction in a function, cout.

.reg .b64 __cuda__cout_global;

.func () cout (.reg .v2 .b32 val){

	st.global.b32 [__cuda__cout_global], val.y;

}

.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){

	mov.v2.b32 %rval, in;

	ret;

}

.entry worker (

	.param .u64 __cudaparam__cin,

	.param .u64 __cudaparam__cout){

	.reg .v2 .b32 x;

	.reg .v2 .b32 %tmp<2>;

	ld.param.u64 __cuda__cin_global, [__cudaparam__cin];

	ld.param.u64 __cuda__cout_global, [__cudaparam__cout];

	mov.b32 %tmp0.x, 4;

	mov.b32 %tmp0.y, 0;

	call (x), fun, (%tmp0);

	mov.v2.b32 %tmp1, x;

	call cout, (%tmp1);

}

Funny thing is that if I make the “call (x), fun, (%tmp0);” call twice in a row it works fine.

Am I missing something completely, or is there some bug in PTX2.0?

Any help is much appreciated!

This, by the way, is on a GPU with compute capability 1.3. I thought that using vectors as parameters worked in 1.3, but maybe it doesnt. I’m hoping to try it out on a card with compute capability 2.0 tonight, hopefully that will work. Otherwise I can try placing the parameters in the .param namespace instead (in cc 2.0 that is, 1.3 doesnt support that).

Just to answer my own question - 1.3 does apparently not support using vectors as function parameters. Running the program on a GPU with compute capability 2.0 worked.