random mapping faster when range smaller memory bandwidth issues

this is verbose to split the problem from the context.
say I use 64 threads to write 64 ints from sharedmem to globalmem.
these ints are coaelesced in sharedMem: int s_src[64], but they should be mapped to a unique while random target position in globalMem, say: thread 0 writes s_src[0] to d_dst[23], thread 1 writes s_src[1] to d_dst[41], …, since the map is 0->23, 1->41…
this random mapped write is slow if the dataset is huge.
however, if i use:
kernel0:
for thread tx, if s_src[tx] maps to 0~7 of d_dst, then actually write it to d_dst; o/w skip.
kernel1:
for thread tx, if s_src[tx] maps to 8~15 of d_dst, then actually write it to d_dst; o/w skip.

this multi-pass, smaller-range random mapping is faster.
I can’t use any coalesce knowledge to explain this. Why is that?
Thanks a lot!

Could any one gives a guess? Thanks a lot!

seems no one has experienced this phenomenon…:(

my own guess is, when writing locations range large, spatial locality is low and hardly any warp can be coalesced; while using multipass writing smaller ranges,

maybe something are coalesced written.

am i right?..