Speeding up memory writes

Hi.
Actually my applications can’t use coalesced memory access patterns, but this is not a problem for memory reads (I’m using linear memory + texture bind on it).
Now the code is highly optimized, but almost all time it takes to run is due to memory writing (eg: on a kernel executing in 7ms, memory writing takes 6ms).
Do someone know some trick to speed it up (eg: something that automatically coalesce writes, similar to textures coalescing [even thought they cache data, too] memory reads)?

Thank you.

-Giacomo

how did you time that? When you are not writing to memory at the end of your kernel, all the calculation is removed by the compiler.

Are the writes at least somewhat localized? A standard trick to coalesce writes is to use a shared memory array as a staging area. Something like:

__shared__ float buffer[256];

//

// threads write to elements of buffer[] in any order here

//

__syncthreads();

// Write out buffer in coalesced way to global memory

for (int i=threadIdx.x; i < 256; i += blockDim.x)

  g_mem[i] = buffer[i];

Note that I just made up the 256 element buffer. Your problem will be sized differently, of course.

Well, first of all reading accesses cannot be aligned to half warp size (I have a big array (up to some Gb) and every block need to process some elements starting from some almost-random indexes).

Riedjik: what do you mean? Actually every thread does:
//function
for i=blockid… step is number of blocks
some computation (thread 0)

 every thread does some computation, loading data from global memory, processing it and adding to shared variable (every thread has its shared variable)

thread 0 gathers the results (from a shared array [number of threads])
thread 0 writes a value to global memory (every block will write a value)

end for
//end function

Coalescing would require restructuring those functions, but doing so would remove some benefits I’m getting from this one, and it would really slow down them.

The dead-code optimization removes code that calculates things that do not get written out to global memory at the end. So when you say your calculation takes 1 ms and writing takes 6 ms, I wonder how you timed the calculation part. It might be that there was completely no calculation at all in that case.

One way to find out is to add -keep to your nvcc commandline and check the ptx if there is really calculation going on. If you timed it by not calculating anything, but writing standard values, you will have it accurately indeed.

You can do this by means of a reduction, which might save you some time again.

Oh, sorry, I didn’t know that.
I used to remove the write instruction to calculate the computation part.