I have a really simple kernal which just copy the data from an array to another array:
[codebox]global F(float2 *data1, float2 *data2)
{
int j = blockDim.x * blockIdx.x + threadIdx.x;
float2 v = data1[j];
data2[j] = v;
}
[/codebox]
While I inspect the profiler, it reports 8192 gld coalesced but 65536 gst coalesced… I thought these two number should be the same. Does anyone know the reason for this? Thanks.
BTW, I am using 9600GT and CUDA 2.3.
Does anyone knows about the reason for this? Thanks.
I have no idea why there would be a different number of loads than stores.
One thing I can note, however, is that it’s probably better to cast the arrays to float rather than float2, so that threads in a block operate on consecutive words in memory rather than consecutive pairs of words. e.g.
__global__ F(float2 *data1, float2 *data2)
{
int j = blockDim.x * blockIdx.x + threadIdx.x;
float v = ((float*)data1)[j];
((float*)data2)[j] = v;
}
Or even
__global__ F(float *data1, float *data2)
{
int j = blockDim.x * blockIdx.x + threadIdx.x;
float v = data1[j];
data2[j] = v;
}
Hi Keldor314,
I have tried your method to write the memory using the floats, but the high number of stores is still there…
Totally at a loss.
What parameters for block and grid size you use?
On GTX275 i got equal number of gst and gld.
On 8800 GT i got gst 4 times more than gld (it is what you talk about)
I guess the gld and gst number should be the same… I can’t understand why it needs more writes than reads…