Why is my global memory write so slow?

I’m using a GTX 260. My kernel looks like this:

__global__ void k(float *x0, float *y0, float *z0, ... complex32 *dataout){

complex32 c[169];

int f = blockIdx.x * blockDim.x + threadIdx.x;

//perform work to write data to "c"

	for(i=0; i<169; i++){

		dataout[(169*f)+i].r = c[i].r;

		dataout[(169*f)+i].i = c[i].i;

	}

}

The complex32 struct is as follows:

typedef struct{

float r;

float i;

} complex32;

The kernel “k” is run 512 times, so f ranges from 0 to 511. Threads per block is set to 16, as that seems to be the fastest for my kernel. Some of the 512 threads will finish working and get to the global memory write stage before others do. Without the write, the kernel takes 185 ms to finish. With the write included, it takes around 1680 ms when compiled with -arch sm_11. (For whatever reason it takes around 6000 ms with -arch sm_13) I only need to write 1695128 = 692,224 bytes to global memory, so why does it take a second and a half to do so? Or, what other data/memory structure can I use to speed it up? Even if I add “align(8)” to my struct, or use the float2 datatype, the speed doesn’t change.

Someone please correct me if I’m wrong, but I believe that the complex array ‘c’ is stored in “local” memory, which is somewhat of a misnomer, because “local” memory actually resides in global memory. So by eliminating the writes, you are actually eliminating two global writes and two global reads.

Also, if the compiler sees you are not writing to global memory, it may optimize out a lot of your code, since it sees that your kernel isn’t really doing anything.

The reads/writes probably aren’t coalesced either, since threads are concurrently accessing every other float (all the real parts, then all the imaginary parts). If possible, try doubling the number of threads and have each thread only deal with one float (either the real or imaginary part). Then you should be able to get the threads to access memory sequentially, which will coalesce the reads/writes.

Yes, you’re right, it is optimizing out a lot of the calculations that aren’t being written to global memory. If for example I create a new integer vector, size 512, and I have something like:
y[f] = (int)c[i].x;
Which indicates one 4-bit integer write per thread, and here it takes 1700+ ms.
I’ll need to optimize my kernel’s local memory calculations to make it faster.