memory accesses by thread block accessing memory by thread block is only semi-coalesced?

jlv · February 11, 2009, 11:14pm

Has anyone noticed that accessing memory by thread block is only semi-coalesced?

If I have thread number zero in each block access global memory, every few decide to not coalesce. For example, if I add the following code to the end of the template kernel in the SDK:

[codebox]

__syncthreads();

if(tid==0) g_odata[bid] += bid;

[/codebox]

with 128 threads in each of 32 blocks, block ids 8 and 24 decide to make ONE uncoalesced READ and TWO uncoalesced WRITES EACH for this statement.

I used the template project to recreate the same problem I’m having in a more complex kernel. For some reason, in that kernel the “deviant” block ids are 2, 10, 18, and 26 of 32 total blocks.

Any insight or correction at all would be much appreciated. Thanks!

kristleifur · February 12, 2009, 12:08pm

Arrgh, my browser + forum software have eaten two successive attepmts at a reply. I don’t have time to elaborate, but does this make sense?

[codebox]__syncthreads();

if(tid== bid & 16) g_odata[bid] += bid;[/codebox]

You will get more coalesced action unless I’m mistaken, but I don’t know if performance is better.

jlv · February 12, 2009, 6:17pm

Ah, thanks. I just used “if(tid==bid)” without the “& 16” and the uncoalesced action disappeared.

I don’t understand why it makes a difference, though. before, I still had 32 threads active at once accessing a contiguous block of global memory. The scheduler must use the actual blockId.x or something?

again, thanks very much!

kristleifur · February 12, 2009, 7:50pm

NP :) Just watch that this breaks down if you have more blocks than threads – you’ll get no writes from bid’s higher than the max tid.

it’s safer / cleaner IMO to do “& 16”

Did it help performance? It’s different because each individual thread processing hardware unit is “aligned” to memory in a certain way. Btw - A G200 card allows you to mostly ignore all of this.

kristleifur · February 15, 2009, 9:50am

Hi,

did it help performance?

Jamie_K · February 16, 2009, 3:25pm

Is it even reasonable to expect accesses from different blocks to coalesce? The programming guide speaks of coalescing of multiple thread accesses within a block, specifically within a half-warp. But multiple blocks may not reside on the same multiprocessor, or if they do, I thought they were time-sliced. If they are on different multiprocessors, how do you know if they are exactly synchronized? Doesn’t syncthreads only synchronize within a block?

kristleifur · February 16, 2009, 7:36pm

Exactly …

… what I’m interested in w.r.t performance is whether it’s any good to ‘align’ memory accesses that are still disparate.

paulius · February 16, 2009, 8:37pm

No, accesses by threads from different threadblocks (or even half-warps, for that matter) are not coalesced. Memory coalescing is done at half-warp (16 consecutive threads) level. See the Programming Guide or SC08 slides 4-12 in the Optimizing CUDA section (www.gpgpu.org/sc2008).

Paulius

Topic		Replies	Views
coalescing problem CUDA Programming and Performance	4	1130	August 8, 2011
want to know more detail of memory coalescing CUDA Programming and Performance	4	2074	November 11, 2008
confusions about coalesce access CUDA Programming and Performance	3	4950	January 9, 2009
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1596	August 2, 2011
Memory coalescing in one thread CUDA Programming and Performance	17	16799	March 31, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2233	December 9, 2010
coalesced access to global memory block-wise access vs element-wise access CUDA Programming and Performance	0	1542	March 17, 2010
Need some help to understand how to coalesce memory access CUDA Programming and Performance	4	1056	June 30, 2010
Is these way coalesced access? CUDA Programming and Performance	0	420	March 6, 2020
Accessing same global memory address within warps CUDA Programming and Performance	4	4434	October 24, 2018

memory accesses by thread block accessing memory by thread block is only semi-coalesced?

Related topics