I am a little ashamed of myself, but i noticed that in my previous benchmarks, the memcpy was included in the measurement :">
Again a benchmark, this time all timing information is retrieved from the cuda profiler and for the copy kernel only.
__global__ void offsetCopy(float *odata, float* idata, int offset_read, int offset_write)
{
int xid_read = blockIdx.x * blockDim.x + threadIdx.x + offset_read;
int xid_write = blockIdx.x * blockDim.x + threadIdx.x + offset_write;
odata[xid_read] = idata[xid_write];
}
__global__ void offsetCopy4(float4 *odata, float4* idata, int offset_read, int offset_write)
{
int xid_read = blockIdx.x * blockDim.x + threadIdx.x + offset_read;
int xid_write = blockIdx.x * blockDim.x + threadIdx.x + offset_write;
odata[xid_read] = idata[xid_write];
}
By using offsets that are not a multiple of 16, stores/loads become uncoalescing on a 1.1 CUDA device (such as mine)
Results are GPU times of the kernel, run on 1,280,000 threads on a GeForce 8800 GT.
coalescing
float 0.210 ms (48 G/s)
float4 0.900 ms (45 G/s)
read uncoalescing
float 1.300 ms (8 G/s)
float4 1.500 ms (27 G/s)
write uncoalescing
float 0.950 ms (11 G/s)
float4 1.350 ms (30 G/s)
read+write uncoalescing
float 1.850 ms (5 G/s)
float4 1.900 ms (21 G/s)
A found this relevant explanation in one of the CUDA documents:
This also explaines why for āread+write uncoalescingā it makes no difference if I use float or float4. What these benchmarks show is that not only coalescing reads, but also coalescing writes are VERY important for the total memory throughput.