Why 128byte transaction slower than 64byte one?

Hello.

I made 2 types of cuda programs that simply copy data from global memory to global memory.

One copies 4 byte per thread, thus each access of 16 threads(half warp) coalesced to one 64 byte transaction(4 * 16).

Another copies 8 byte per thread, thus each access of 16 threads(half warp) coalesced to one 128 byte transaction(8 * 16).

The copying size of both is same, so the total number of threads of the second(8 byte per thread) is half that of the first(4 byte per thread).

By the same token, the total number of transactions of the second is half that of the first.

I think the second is much faster than the first, because performs much less transactions than the first.

Even though, the second costs a little more time than the first, in fact.

Can anyone explain why ?

I use CUDA 2.3 on GTX-295.

My code (abridged) is below.

Thanks in advance.

[codebox]

// Kernel copies 4 byte

global void kernelCopy4(unsigned long* src, unsigned long* dst)

{

dst[blockIdx.x * blockDim.x + threadIdx.x] = src[blockIdx.x * blockDim.x + threadIdx.x];

}

// Kernel copies 8 byte

global void kernelCopy8(unsigned long long int* src, unsigned long long int* dst)

{

dst[blockIdx.x * blockDim.x + threadIdx.x] = src[blockIdx.x * blockDim.x + threadIdx.x];

}

int main(int argc, char** argv)

{

dim3 dimBlock1(512);

dim3 dimGrid1(32768);

// Copies 64MB(512×32768×4)

kernelCopy4<<<dimGrid1, dimBlock1>>>((unsigned long*)src, (unsigned long*)dst);

dim3 dimBlock2(512);

dim3 dimGrid2(32768 / 2);

// Copies 64MB(512×16384×8)

kernelCopy8<<<dimGrid2, dimBlock2>>>((unsigned long long int*)src, (unsigned long long int*)dst);

}

[/codebox]