Global Memory Coalescing: Read and Write Memory Coalescing

tcullison · July 23, 2007, 7:39pm

I have a kernel that I would like to perform both coalesced reads and writes that are to and from the same global memory array. However, I have been unable to to this unless my writes are coalesced from shared memory, to an array in global memory, that is diffrent from the global memory array for which I coalesces my reads from.

Is there a way aound this?

Also, is there a way to check if I am trueley coalescing (besides keeping track of the numbers)? Is there a function or a Macro or a message I can check for?

Thanks,
tcullison

paulius · July 23, 2007, 9:27pm

I’m not sure I follow what your problem is. Why can’t you coalesce when writing?

Reading/writing to the same array in global memory should not cause issues. For example, kernels that read, increment, write have no problems with coalescing.

Paulius

tcullison · July 24, 2007, 1:50pm

I’m pretty sure that am able to get my reads and my writes to coalesce. However, If I coalesce a read(s) from global memory into shared memory, I am unable to coalesce a write back to the same global mem array where I read from. I can, however, coalesce a write back to a differnt array in global memory. For example, in the transpose example, there is an *idata and an *odata. Is it possible to coalesce a read from *idata and when I’m finished making calculations, coalesce a write back to *idata?

Another question I have is: besides performance increases, is there a way to verify if I’m coalescing reads or writes?

Thanks,
tcullison

paulius · July 24, 2007, 6:48pm

Check out the SDK Transpose sample. It shows how to coalesce both reads and writes, by using shared memory.

A quick way to check for coalescing right now, would be to run the kernel with compute portion of the code commented out. So, only the reads/writes to global memory are performed. Measure the achieved bandwidth. If it is approaching the limit (about 80GB/s, I think), then you most likely have good coalescing. If it’s low (I’d say below 30-40GB/s), then you should double check coalescing. It’s not foolproof, but it will give you an idea.

Paulius

tcullison · July 24, 2007, 11:31pm

I have looked over the transpose example. My problem is I would like to coalesce my reads and writes from and to the same location in global memory.

The transpose example does not do this, it read and writes to different locations in global memory.

Is it a requirement when coalescing that any writes must be writen to a diffrent place in global memory than the location read from.

yk_cadcg · July 25, 2007, 8:19am

no, there’s no reason for such requirement. for us, we only got up to 60GB/s peakly.

paulius · July 25, 2007, 9:57pm

No, there is no such requirement. The transpose sample writes to a different are for result correctness - you don’t want to overwrite tile (x,y) before that tile itself has been written to its transposed location.

You can try the following kernel. Each thread reads a value, increments it, and writes back to the same location. You’ll get coalescing both times. I don’t know what kind of processing your codes does, but if there’s no possibility (due to threadblock scheduling) that a value can be overwritten before it’s used, you should have no problem adopting the approach used in the Transpose sample.

Paulius

tcullison · July 26, 2007, 1:37pm

paulius, and yk_cadcg:

Thank your for your help, I have been able to coalesce both my reads and writes to the same location in global memory. I had been overlooking a tough to see mistake in my code.

Also, thanks for the advice about testing for coalescing. I wasn’t sure of what throughput to expect.

paulius · July 26, 2007, 6:11pm

Out of curiosity, did your times improve? If so, by how much?

Paulius

tcullison · July 31, 2007, 1:43pm

paulius:

I had about a 9x improvement in the total kernel execution time.

Topic		Replies	Views
read from global mem vs write to global mem CUDA Programming and Performance	13	6559	January 22, 2009
An example of coalesced memory access CUDA Programming and Performance	2	3701	June 28, 2010
How bad are non-coalesced STORES to gl. mem? CUDA Programming and Performance	2	2905	August 14, 2008
Speeding up memory writes CUDA Programming and Performance	5	3299	July 3, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10277	June 28, 2009
Moving a (BS_X+1)(BS_Y+1) global memory matrix by BS_XBS_Y threads CUDA Programming and Performance	0	584	December 15, 2012
Coalescence of global memory reading and writing CUDA Programming and Performance	1	526	May 12, 2018
coalesced read to shared memory CUDA Programming and Performance	0	589	October 27, 2011
Batch write CUDA Programming and Performance	1	4884	September 22, 2008
coalesced access to global memory CUDA Programming and Performance	6	1251	May 8, 2014

Global Memory Coalescing: Read and Write Memory Coalescing

Related topics