Global Memory Coalescing: Read and Write Memory Coalescing

I have a kernel that I would like to perform both coalesced reads and writes that are to and from the same global memory array. However, I have been unable to to this unless my writes are coalesced from shared memory, to an array in global memory, that is diffrent from the global memory array for which I coalesces my reads from.

Is there a way aound this?

Also, is there a way to check if I am trueley coalescing (besides keeping track of the numbers)? Is there a function or a Macro or a message I can check for?

Thanks,
tcullison

I’m not sure I follow what your problem is. Why can’t you coalesce when writing?

Reading/writing to the same array in global memory should not cause issues. For example, kernels that read, increment, write have no problems with coalescing.

Paulius

I’m pretty sure that am able to get my reads and my writes to coalesce. However, If I coalesce a read(s) from global memory into shared memory, I am unable to coalesce a write back to the same global mem array where I read from. I can, however, coalesce a write back to a differnt array in global memory. For example, in the transpose example, there is an *idata and an *odata. Is it possible to coalesce a read from *idata and when I’m finished making calculations, coalesce a write back to *idata?

Another question I have is: besides performance increases, is there a way to verify if I’m coalescing reads or writes?

Thanks,
tcullison

Check out the SDK Transpose sample. It shows how to coalesce both reads and writes, by using shared memory.

A quick way to check for coalescing right now, would be to run the kernel with compute portion of the code commented out. So, only the reads/writes to global memory are performed. Measure the achieved bandwidth. If it is approaching the limit (about 80GB/s, I think), then you most likely have good coalescing. If it’s low (I’d say below 30-40GB/s), then you should double check coalescing. It’s not foolproof, but it will give you an idea.

Paulius

I have looked over the transpose example. My problem is I would like to coalesce my reads and writes from and to the same location in global memory.

The transpose example does not do this, it read and writes to different locations in global memory.

Is it a requirement when coalescing that any writes must be writen to a diffrent place in global memory than the location read from.

no, there’s no reason for such requirement. for us, we only got up to 60GB/s peakly.

No, there is no such requirement. The transpose sample writes to a different are for result correctness - you don’t want to overwrite tile (x,y) before that tile itself has been written to its transposed location.

You can try the following kernel. Each thread reads a value, increments it, and writes back to the same location. You’ll get coalescing both times. I don’t know what kind of processing your codes does, but if there’s no possibility (due to threadblock scheduling) that a value can be overwritten before it’s used, you should have no problem adopting the approach used in the Transpose sample.

Paulius

paulius, and yk_cadcg:

Thank your for your help, I have been able to coalesce both my reads and writes to the same location in global memory. I had been overlooking a tough to see mistake in my code.

Also, thanks for the advice about testing for coalescing. I wasn’t sure of what throughput to expect.

Out of curiosity, did your times improve? If so, by how much?

Paulius

paulius:

I had about a 9x improvement in the total kernel execution time.