Illogical Bandwidth results Read and Write Globalmemory


I did some bandwidthtests with global memory.

There are 3 different tests:

  • read only

  • write only

  • read and write

I supposed the “read and write” test would be the avarage of the “read only” and “write only”. The result was another: “read and write” was match more faster than “read only” or “write only”. The bandwidth of the “read and write” I multiplied with 2 because there are 2 transfers. The datatype in all measurements is float.

The kernels:

//copy data from global to global memory (global to global or "read and write")

template <class T> __global__ void copy_gmem(T* g_idata, T* g_odata, T c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = g_idata[idx];


//copy data from global to shared memory (read only)

template <class T> __global__ void read_only_gmem(T* g_idata, T* g_odata, T c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	__shared__ T shared[BLOCK_SIZE];

	shared[threadIdx.x] = g_idata[idx];

	*((float *)(&shared[(threadIdx.x + 1) & (BLOCK_SIZE-1)])) += 1.0;


//writes a constant to the global memory (write only)

template <class T> __global__ void write_only(T* g_idata, T* g_odata, T c)


	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = c;


The results are in the attachment.


What is the reason for the higher bandwidth in the “read and write” test? (Why it is not the average of “read only” and “write only”)

greetings lanzelot