random mapping faster when range smaller memory bandwidth issues

yk_cadcg · March 13, 2007, 3:40pm

this is verbose to split the problem from the context.
say I use 64 threads to write 64 ints from sharedmem to globalmem.
these ints are coaelesced in sharedMem: int s_src[64], but they should be mapped to a unique while random target position in globalMem, say: thread 0 writes s_src[0] to d_dst[23], thread 1 writes s_src[1] to d_dst[41], …, since the map is 0->23, 1->41…
this random mapped write is slow if the dataset is huge.
however, if i use:
kernel0:
for thread tx, if s_src[tx] maps to 0~7 of d_dst, then actually write it to d_dst; o/w skip.
kernel1:
for thread tx, if s_src[tx] maps to 8~15 of d_dst, then actually write it to d_dst; o/w skip.

this multi-pass, smaller-range random mapping is faster.
I can’t use any coalesce knowledge to explain this. Why is that?
Thanks a lot!

yk_cadcg · March 18, 2007, 3:38am

Could any one gives a guess? Thanks a lot!

this is verbose to split the problem from the context.

say I use 64 threads to write 64 ints from sharedmem to globalmem.

these ints are coaelesced in sharedMem: int s_src[64], but they should be mapped to a unique while random target position in globalMem, say: thread 0 writes s_src[0] to d_dst[23], thread 1 writes s_src[1] to d_dst[41], …, since the map is 0->23, 1->41…

this random mapped write is slow if the dataset is huge.

however, if i use:

kernel0:

for thread tx, if s_src[tx] maps to 0~7 of d_dst, then actually write it to d_dst; o/w skip.

kernel1:

for thread tx, if s_src[tx] maps to 8~15 of d_dst, then actually write it to d_dst; o/w skip.

this multi-pass, smaller-range random mapping is faster.

I can’t use any coalesce knowledge to explain this. Why is that?

Thanks a lot!

[snapback]171019[/snapback]

yk_cadcg · March 21, 2007, 1:15pm

seems no one has experienced this phenomenon…:(

my own guess is, when writing locations range large, spatial locality is low and hardly any warp can be coalesced; while using multipass writing smaller ranges,

maybe something are coalesced written.

am i right?..

this is verbose to split the problem from the context.

say I use 64 threads to write 64 ints from sharedmem to globalmem.

these ints are coaelesced in sharedMem: int s_src[64], but they should be mapped to a unique while random target position in globalMem, say: thread 0 writes s_src[0] to d_dst[23], thread 1 writes s_src[1] to d_dst[41], …, since the map is 0->23, 1->41…

this random mapped write is slow if the dataset is huge.

however, if i use:

kernel0:

for thread tx, if s_src[tx] maps to 0~7 of d_dst, then actually write it to d_dst; o/w skip.

kernel1:

for thread tx, if s_src[tx] maps to 8~15 of d_dst, then actually write it to d_dst; o/w skip.

this multi-pass, smaller-range random mapping is faster.

I can’t use any coalesce knowledge to explain this. Why is that?

Thanks a lot!

[snapback]171019[/snapback]