All blocks read same global memory location section. Fastest method is?

Hi! My algorithm have all blocks read a same global memory location section, what is the fastest way? I guess here have no broadcast? Because the more blocks I have, the longer it takes!

Thank you!!!

How much data is being read in this fashion? How often does it change during a single kernel invocation? Currently, is this global memory data accessed using the most advantageous access pattern (base + thread_index)? Currently, how many threads are in each of the thread blocks?

1 Like

Thank you very much for your attention!!!
Here I have 64 threads in one block and I might allocate blockDim.x=10, blockDim.y=1
for each block, about 8*32 floating point values should be readed(might be little bit more, but not too much)
And here is how it read:
in Block.x=0:
th0 read data0, th1 read data1 … until th63 read data63 and then put them into registers for later calculation.

in Block.x=1:
Same thing!! Notice the th0 also read data0! Same data0!!!

So I noticed constant memory, you asked me how much value, I guess you are also thinking this. But I find out, constant memory is fast only(?) when all threads in one wrap read a same value…And I guess the L2 in global memory will work? Which means, maybe at the same time Block0, 1, 2, 3 can work, and their reading will compete, no choice. But value will be cached, and for later block4 read same data section, cached data can be reused very fastly!

Thank you!!! What do you think??

haha, that is also me. Thank you very much for that~

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.