Speeding up memory writes

spg · July 3, 2008, 4:12pm

Hi.
Actually my applications can’t use coalesced memory access patterns, but this is not a problem for memory reads (I’m using linear memory + texture bind on it).
Now the code is highly optimized, but almost all time it takes to run is due to memory writing (eg: on a kernel executing in 7ms, memory writing takes 6ms).
Do someone know some trick to speed it up (eg: something that automatically coalesce writes, similar to textures coalescing [even thought they cache data, too] memory reads)?

Thank you.

-Giacomo

E.D_Riedijk · July 3, 2008, 5:32pm

how did you time that? When you are not writing to memory at the end of your kernel, all the calculation is removed by the compiler.

seibert · July 3, 2008, 5:37pm

Are the writes at least somewhat localized? A standard trick to coalesce writes is to use a shared memory array as a staging area. Something like:

__shared__ float buffer[256];

//

// threads write to elements of buffer[] in any order here

//

__syncthreads();

// Write out buffer in coalesced way to global memory

for (int i=threadIdx.x; i < 256; i += blockDim.x)

  g_mem[i] = buffer[i];

Note that I just made up the 256 element buffer. Your problem will be sized differently, of course.

spg · July 3, 2008, 5:47pm

Well, first of all reading accesses cannot be aligned to half warp size (I have a big array (up to some Gb) and every block need to process some elements starting from some almost-random indexes).

Riedjik: what do you mean? Actually every thread does:
//function
for i=blockid… step is number of blocks
some computation (thread 0)

 every thread does some computation, loading data from global memory, processing it and adding to shared variable (every thread has its shared variable)

thread 0 gathers the results (from a shared array [number of threads])
thread 0 writes a value to global memory (every block will write a value)

end for
//end function

Coalescing would require restructuring those functions, but doing so would remove some benefits I’m getting from this one, and it would really slow down them.

E.D_Riedijk · July 3, 2008, 6:01pm

The dead-code optimization removes code that calculates things that do not get written out to global memory at the end. So when you say your calculation takes 1 ms and writing takes 6 ms, I wonder how you timed the calculation part. It might be that there was completely no calculation at all in that case.

One way to find out is to add -keep to your nvcc commandline and check the ptx if there is really calculation going on. If you timed it by not calculating anything, but writing standard values, you will have it accurately indeed.

You can do this by means of a reduction, which might save you some time again.

spg · July 3, 2008, 6:10pm

Oh, sorry, I didn’t know that.
I used to remove the write instruction to calculate the computation part.

Topic		Replies	Views
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8369	July 31, 2007
read from global mem vs write to global mem CUDA Programming and Performance	13	6617	January 22, 2009
Batch write CUDA Programming and Performance	1	4895	September 22, 2008
Global memory write cost CUDA Programming and Performance	4	8021	March 11, 2011
An example of coalesced memory access CUDA Programming and Performance	2	3726	June 28, 2010
numbers of write to global memory for each thread CUDA Programming and Performance	3	2160	March 31, 2008
how do coalesced writes work CUDA Programming and Performance	9	5145	April 2, 2007
How bad are non-coalesced STORES to gl. mem? CUDA Programming and Performance	2	2923	August 14, 2008
Coalesced writes CUDA Programming and Performance	2	1328	May 26, 2016
Coalescing access CUDA Programming and Performance	3	815	March 2, 2012

Speeding up memory writes

Related topics