Why 128byte transaction slower than 64byte one?

sarutake.nv · December 4, 2009, 4:18am

Hello.

I made 2 types of cuda programs that simply copy data from global memory to global memory.

One copies 4 byte per thread, thus each access of 16 threads(half warp) coalesced to one 64 byte transaction(4 * 16).

Another copies 8 byte per thread, thus each access of 16 threads(half warp) coalesced to one 128 byte transaction(8 * 16).

The copying size of both is same, so the total number of threads of the second(8 byte per thread) is half that of the first(4 byte per thread).

By the same token, the total number of transactions of the second is half that of the first.

I think the second is much faster than the first, because performs much less transactions than the first.

Even though, the second costs a little more time than the first, in fact.

Can anyone explain why ?

I use CUDA 2.3 on GTX-295.

My code ï¼ˆabridged) is below.

Thanks in advance.

[codebox]

// Kernel copies 4 byte

global void kernelCopy4(unsigned long* src, unsigned long* dst)

{

dst[blockIdx.x * blockDim.x + threadIdx.x] = src[blockIdx.x * blockDim.x + threadIdx.x];

}

// Kernel copies 8 byte

global void kernelCopy8(unsigned long long int* src, unsigned long long int* dst)

{

dst[blockIdx.x * blockDim.x + threadIdx.x] = src[blockIdx.x * blockDim.x + threadIdx.x];

}

int main(int argc, char** argv)

{

dim3 dimBlock1(512);

dim3 dimGrid1(32768);

// Copies 64MB(512Ã—32768Ã—4)

kernelCopy4<<<dimGrid1, dimBlock1>>>((unsigned long*)src, (unsigned long*)dst);

dim3 dimBlock2(512);

dim3 dimGrid2(32768 / 2);

// Copies 64MB(512Ã—16384Ã—8)

kernelCopy8<<<dimGrid2, dimBlock2>>>((unsigned long long int*)src, (unsigned long long int*)dst);

}

[/codebox]

Topic		Replies	Views
why 256byte loads slower than 128byte loads? CUDA Programming and Performance	6	7065	February 11, 2010
Memory transaction size CUDA Programming and Performance	4	14671	April 13, 2009
memory transaction size for compute capability 1.2 or later CUDA Programming and Performance	2	752	May 4, 2011
Global memory bandwidth profiling? CUDA Programming and Performance	1	771	November 14, 2011
Memory transaction size and coalesced access CUDA Programming and Performance	6	4967	November 12, 2008
Coalescing on Devices with Compute Capability 1.2 CUDA Programming and Performance	1	2239	July 10, 2008
Α beginner's question Jetson Nano	2	335	October 18, 2021
Are memory fetches 64 bytes _minimum_? CUDA Programming and Performance	1	2582	October 17, 2008
Coalesced access to global memory for double4 CUDA Programming and Performance	8	3693	September 8, 2015
Global load transaction count when in coalesced memory access Visual Profiler and nvprof	3	2257	July 7, 2017