coalescing struct loading problem

I am a little ashamed of myself, but i noticed that in my previous benchmarks, the memcpy was included in the measurement :">

Again a benchmark, this time all timing information is retrieved from the cuda profiler and for the copy kernel only.

__global__ void offsetCopy(float *odata, float* idata, int offset_read, int offset_write) 

{ 

	int xid_read = blockIdx.x * blockDim.x + threadIdx.x + offset_read; 

	int xid_write = blockIdx.x * blockDim.x + threadIdx.x + offset_write; 

	odata[xid_read] = idata[xid_write]; 

}

__global__ void offsetCopy4(float4 *odata, float4* idata, int offset_read, int offset_write) 

{ 

	int xid_read = blockIdx.x * blockDim.x + threadIdx.x + offset_read; 

	int xid_write = blockIdx.x * blockDim.x + threadIdx.x + offset_write; 

	odata[xid_read] = idata[xid_write]; 

}

By using offsets that are not a multiple of 16, stores/loads become uncoalescing on a 1.1 CUDA device (such as mine)

Results are GPU times of the kernel, run on 1,280,000 threads on a GeForce 8800 GT.

coalescing

float 0.210 ms (48 G/s)

float4 0.900 ms (45 G/s)

read uncoalescing

float 1.300 ms (8 G/s)

float4 1.500 ms (27 G/s)

write uncoalescing

float 0.950 ms (11 G/s)

float4 1.350 ms (30 G/s)

read+write uncoalescing

float 1.850 ms (5 G/s)

float4 1.900 ms (21 G/s)

A found this relevant explanation in one of the CUDA documents:

This also explaines why for ā€˜read+write uncoalescing’ it makes no difference if I use float or float4. What these benchmarks show is that not only coalescing reads, but also coalescing writes are VERY important for the total memory throughput.

It also shows that as long as you are coalesced, it does not matter if you use float or float4 type. Good to know! :)