I’m actually glad that someone came across this again. My friend posted this as a potential bug about 2 months ago and didn’t get much response. This problem manifests itself only with variables declared with the device specifier, in which case memory operations to the variables are sometimes converted into conditional selection. This means that all threads will load and store to the variable, not only those for which the condition threadIx.x == 0, introducing a race condition.
In your example the resulting PTX for the second kernel is as follows:
.entry _Z9theAnswerPii (
.param .u32 __cudaparm__Z9theAnswerPii_result,
.param .s32 __cudaparm__Z9theAnswerPii_param)
{
.reg .u32 %r<9>;
.reg .pred %p<3>;
.loc 16 9 0
$LBB1__Z9theAnswerPii:
cvt.u32.u16 %r1, %tid.x;
mov.u32 %r2, 0;
setp.eq.u32 %p1, %r1, %r2;
ld.param.s32 %r3, [__cudaparm__Z9theAnswerPii_param];
ld.global.s32 %r4, [gval];
selp.s32 %r5, %r3, %r4, %p1;
st.global.s32 [gval], %r5;
.loc 16 12 0
bar.sync 0;
@!%p1 bra $Lt_1_1794;
.loc 16 14 0
ld.global.s32 %r6, [gval];
ld.param.u32 %r7, [__cudaparm__Z9theAnswerPii_result];
st.global.s32 [%r7+0], %r6;
$Lt_1_1794:
.loc 16 15 0
exit;
$LDWend__Z9theAnswerPii:
} // _Z9theAnswerPii
The following code section is always executed, by all threads. Even though only the first thread will store the new value, all other threads will store the old value.
ld.global.s32 %r4, [gval];
selp.s32 %r5, %r3, %r4, %p1;
st.global.s32 [gval], %r5;
Note that declaring the variable volatile will make sure that it is correctly wrapped with a branch so that there is not race condition. The new section becomes
setp.eq.u32 %p1, %r1, %r2;
@!%p1 bra $Lt_1_1794;
ld.param.s32 %r3, [__cudaparm__Z9theAnswerPii_param];
st.volatile.global.s32 [gval], %r3;
$Lt_1_1794:
Note the branch instruction ‘@!%p1 bra $Lt_1_1794;’ replaces the conditional select ‘selp.s32 %r5, %r3, %r4, %p1;’