memory accesses by thread block accessing memory by thread block is only semi-coalesced?

Has anyone noticed that accessing memory by thread block is only semi-coalesced?

If I have thread number zero in each block access global memory, every few decide to not coalesce. For example, if I add the following code to the end of the template kernel in the SDK:

[codebox]

__syncthreads();

if(tid==0) g_odata[bid] += bid;

[/codebox]

with 128 threads in each of 32 blocks, block ids 8 and 24 decide to make ONE uncoalesced READ and TWO uncoalesced WRITES EACH for this statement.

I used the template project to recreate the same problem I’m having in a more complex kernel. For some reason, in that kernel the “deviant” block ids are 2, 10, 18, and 26 of 32 total blocks.

Any insight or correction at all would be much appreciated. Thanks!

Arrgh, my browser + forum software have eaten two successive attepmts at a reply. I don’t have time to elaborate, but does this make sense?

[codebox]__syncthreads();

if(tid== bid & 16) g_odata[bid] += bid;[/codebox]

You will get more coalesced action unless I’m mistaken, but I don’t know if performance is better.

Ah, thanks. I just used “if(tid==bid)” without the “& 16” and the uncoalesced action disappeared.

I don’t understand why it makes a difference, though. before, I still had 32 threads active at once accessing a contiguous block of global memory. The scheduler must use the actual blockId.x or something?

again, thanks very much!

NP :) Just watch that this breaks down if you have more blocks than threads – you’ll get no writes from bid’s higher than the max tid.

  • it’s safer / cleaner IMO to do “& 16”

Did it help performance? It’s different because each individual thread processing hardware unit is “aligned” to memory in a certain way. Btw - A G200 card allows you to mostly ignore all of this.

Hi,

did it help performance?

Is it even reasonable to expect accesses from different blocks to coalesce? The programming guide speaks of coalescing of multiple thread accesses within a block, specifically within a half-warp. But multiple blocks may not reside on the same multiprocessor, or if they do, I thought they were time-sliced. If they are on different multiprocessors, how do you know if they are exactly synchronized? Doesn’t syncthreads only synchronize within a block?

Exactly …

… what I’m interested in w.r.t performance is whether it’s any good to ‘align’ memory accesses that are still disparate.

No, accesses by threads from different threadblocks (or even half-warps, for that matter) are not coalesced. Memory coalescing is done at half-warp (16 consecutive threads) level. See the Programming Guide or SC08 slides 4-12 in the Optimizing CUDA section (www.gpgpu.org/sc2008).

Paulius