How bad are non-coalesced STORES to gl. mem?

Fuchs · August 14, 2008, 7:18pm

I’m about to develop a kernel where I need to permute the resulting vector which resides in shared memory before writing back to global memory. (Implementing walsh sequency ordered transform by extending the SDK hadamard ordered walsh transform)

I’m aware that reads from global memory should always be coalesced since the program cannot continue without having the value i.e. needs to wait. Though, when writing a value to global memory it would make sense to me if the program just continues without waiting for (? 400-600 ?) cycles for the values to be written. Provided that there are no data-dependencies (i.e. reads from the same global memory).

I also noticed that some SDK examples don’t write to global memory coalesced but focus on coalesced reads.

Basically what I need to decide is: Given the Hadmard sequence I need to permutate it e.g. for 8 values in such a way:

Hadamard: 1 2 3 4 5 6 7 8

Walsh   : 1 8 4 5 2 7 3 6

Now I can either take the vector in shared memory and store it in global memory having non-coalesced writes all over the place or permute in shared memory (lots of bank conflicts) and store it coalesced…

Any thoughts?

MisterAnderson42 · August 14, 2008, 8:03pm

Do the permutation in shared memory and coalesce the writes to global memory. Writing coalesced is just as important as reading. To prove it, I just modified my bw_test code (search the forums if you want it) to read coalesced, but write uncoalesced. For the test that copies floats from one array to another, performance dropped from 56.8 GiB/s bandwidth to 8.6 GiB/s (on G80).

The issue isn’t so much latency as it is throughput. Uncoalesced reads (and writes) can’t make use of the entire memory bus width like coalesced ones can and the write must be split into several transactions. Things are a little different on compute >1.2 and you can do permutations like this and still coalesce as long as the permutations happen only in defined so many byte wide windows (see the 2.0b2 programming guide for the details).

Fuchs · August 14, 2008, 8:26pm

Thanks for the fast reply.

I guess then I already know how to optimize my other kernels. Yaya

Since I’m permuting over 512 element vector the new coalescing rules are of no use for this operation.
I’ll load all the vector elements from shared memory into 2 registers (I got twice as much elements as threads) and then write them back (with lots of bank conflicts) into shared memory. Just swaping doesn’t work so I will use registers which I’ll have plenty of when I get my GX280…

Topic		Replies	Views
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8316	July 31, 2007
read from global mem vs write to global mem CUDA Programming and Performance	13	6559	January 22, 2009
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4584	May 12, 2015
An example of coalesced memory access CUDA Programming and Performance	2	3701	June 28, 2010
Speeding up memory writes CUDA Programming and Performance	5	3299	July 3, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16767	March 31, 2011
Coalesced writes CUDA Programming and Performance	2	1295	May 26, 2016
Transfer back (on device) to global memory CUDA Programming and Performance	1	1503	September 20, 2008
Coalescing the Global memory load/store not giving any speed-up CUDA Programming and Performance	2	5197	March 7, 2009
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10277	June 28, 2009

How bad are non-coalesced STORES to gl. mem?

Related topics