this is verbose to split the problem from the context.
say I use 64 threads to write 64 ints from sharedmem to globalmem.
these ints are coaelesced in sharedMem: int s_src, but they should be mapped to a unique while random target position in globalMem, say: thread 0 writes s_src to d_dst, thread 1 writes s_src to d_dst, …, since the map is 0->23, 1->41…
this random mapped write is slow if the dataset is huge.
however, if i use:
for thread tx, if s_src[tx] maps to 0~7 of d_dst, then actually write it to d_dst; o/w skip.
for thread tx, if s_src[tx] maps to 8~15 of d_dst, then actually write it to d_dst; o/w skip.
this multi-pass, smaller-range random mapping is faster.
I can’t use any coalesce knowledge to explain this. Why is that?
Thanks a lot!