Sorry to be a pain Cyril, but are you sure you are correct? My offhand logic said the same, concurrent writes would look like bank conflicts at first glance, however, it seems that the current hardware does not work this way and that the compiler has been designed to take advantage of the fact. When I compile this program:
__shared__ int j, k;
__global__ void kernel()
{
if (threadIdx.x == 2) j = 1;
k = threadIdx.x == 2 ? 1 : k;
}
I get to following code:
.entry kernel
{
.reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7,$r8,$r9;
.reg .pred $p0,$p1,$p2;
# .loc 10 3 0
# 1 __shared__ int j, k;
# 2
# 3 __global__ void kernel()
$LBB1_kernel:
# .loc 10 5 0
# 4 {
# 5 if (threadIdx.x == 2) j = 1;
cvt.u32.u16 $r1, %tid.x; #
mov.s32 $r2, 1; #
ld.shared.s32 $r3, j; # id:9 j+0x0
mov.u32 $r4, 2; #
setp.eq.u32 $p1, $r1, $r4; #
selp.s32 $r5, $r2, $r3, $p1; #
st.shared.s32 j, $r5; # id:9 j+0x0
# .loc 10 7 0
# 6
# 7 k = threadIdx.x == 2 ? 1 : k;
ld.shared.s32 $r6, k; # id:10 k+0x0
mov.s32 $r7, 1; #
mov.u32 $r8, 2; #
setp.ne.u32 $p2, $r1, $r8; #
selp.s32 $r9, $r6, $r7, $p2; #
st.shared.s32 k, $r9; # id:10 k+0x0
exit; #
} # kernel
.version 1.1
Careful examination of both code fragments shows that they are identical, one writing shared variable j and the other k. Now the first one writes j to 1 only in thread 2 and the second writes k to 1 in thread 2 and to its original value in all other threads. Both these fragments will have exactly the same effect when run, so one can deduce that they are equivalent expressions, so in warp 0 (which contains thread 2) a write always succeeds for a thread with a different value to what is already stored in the target location (at least if there is only 1 different one). The important thing is that the writing of j (st.shared.s32 instruction) must be concurrent, not serialised, otherwise providing these branch around shared memory writes would NOT give you any performance benefit, and there would always be a 16 way bank conflict, and so likewise the writing of k must be concurrent, not serialised. Why would you change the hardware spec in the future to something sloppier? All stores to shared memory not indexed by tid generate 16 way bank conflicts - seems very unlikely. You would not design hardware to serialise writes when you know that only one is going to succeed in any case. Currently there must be some logic to select the different value to succeed.
Thanks, Eric