The global memory read and write is different? The result in profiler is confusing.

I have a really simple kernal which just copy the data from an array to another array:

[codebox]global F(float2 *data1, float2 *data2)

{

int j = blockDim.x * blockIdx.x + threadIdx.x;

float2 v = data1[j];

data2[j] = v;

}

[/codebox]

While I inspect the profiler, it reports 8192 gld coalesced but 65536 gst coalesced… I thought these two number should be the same. Does anyone know the reason for this? Thanks.

BTW, I am using 9600GT and CUDA 2.3.

Does anyone knows about the reason for this? Thanks.

I have no idea why there would be a different number of loads than stores.

One thing I can note, however, is that it’s probably better to cast the arrays to float rather than float2, so that threads in a block operate on consecutive words in memory rather than consecutive pairs of words. e.g.

__global__ F(float2 *data1, float2 *data2)

{

	int j = blockDim.x * blockIdx.x + threadIdx.x;

	float v = ((float*)data1)[j];

	((float*)data2)[j] = v;

}

Or even

__global__ F(float *data1, float *data2)

{

	int j = blockDim.x * blockIdx.x + threadIdx.x;

	float v = data1[j];

	data2[j] = v;

}

Hi Keldor314,

I have tried your method to write the memory using the floats, but the high number of stores is still there…

Totally at a loss.

What parameters for block and grid size you use?

On GTX275 i got equal number of gst and gld.
On 8800 GT i got gst 4 times more than gld (it is what you talk about)

I guess the gld and gst number should be the same… I can’t understand why it needs more writes than reads…