EDIT: SORRY I REALIZED THIS IS THE WRONG FORUM, AND MADE A SIMILAR POST IN THE DEVELOPMENT FORUM INSTEAD. ANY MODERATOR, FEEL FREE TO DELETE THIS TOPIC!
According to the PTX ISA 2.0 p.49, it is possible to create a function (.func), where the return parameter is a vector base-type. However, I’m having some problem with this, where the system simply locks up.
If I make a function call, and store the result in a 2D vector, the result comes out fine, without any problems. However, if I pass this resulting vector (even if I copy it to a new register) to another function, the system crashes when I try to access to .y component of the vector.
Following is some code that works. The __cuda__cout_global variable is a global variable that holds an output address. The function fun simply returns its input.
In the following code, a vector {4,0} is passed to function fun, and the y-component of the return variable is sent to output, which works fine.
.reg .b64 __cuda__cout_global;
.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){
mov.v2.b32 %rval, in;
ret;
}
.entry worker (
.param .u64 __cudaparam__cin,
.param .u64 __cudaparam__cout){
.reg .v2 .b32 x;
.reg .v2 .b32 %tmp<2>;
ld.param.u64 __cuda__cin_global, [__cudaparam__cin];
ld.param.u64 __cuda__cout_global, [__cudaparam__cout];
mov.b32 %tmp0.x, 4;
mov.b32 %tmp0.y, 0;
call (x), fun, (%tmp0);
mov.v2.b32 %tmp1, x;
st.global.b32 [__cuda__cout_global], x.y;
}
However, in the following example, I make an extra function call to a function, cout, that simply stores its input in the global output. In this case the system hangs. I’ve simply placed the st.global.b32 instruction in a function, cout.
.reg .b64 __cuda__cout_global;
.func () cout (.reg .v2 .b32 val){
st.global.b32 [__cuda__cout_global], val.y;
}
.func (.reg .v2 .b32 %rval) fun (.reg .v2 .b32 in){
mov.v2.b32 %rval, in;
ret;
}
.entry worker (
.param .u64 __cudaparam__cin,
.param .u64 __cudaparam__cout){
.reg .v2 .b32 x;
.reg .v2 .b32 %tmp<2>;
ld.param.u64 __cuda__cin_global, [__cudaparam__cin];
ld.param.u64 __cuda__cout_global, [__cudaparam__cout];
mov.b32 %tmp0.x, 4;
mov.b32 %tmp0.y, 0;
call (x), fun, (%tmp0);
mov.v2.b32 %tmp1, x;
call cout, (%tmp1);
}
Am I missing something completely, or is there some bug in PTX2.0?
Any help is much appreciated!