I wanted to optimize my application, so I used the profiler to get some information especially about memory access.
I saw that there was many Uncoalesced gld, so I read the Programming guide and many other tutorials.
They all said, that a memory access is coalesced if
-
we access 4B/8B/16B
-
the memory is aligned, so that next thread accesses next memory block (so there is no space between the blocks)
-
first thread accesses N*64B
so could anybody tell me what is wrong with my code (or my mind)?
struct testStruct
{
int x;
int y;
ON_CUDA testStruct(){}
ON_CUDA testStruct(int x, int y = 0, int z = 0) : x(x), y(y){}
};
__global__ static void testScale(testStruct *ret0) {
const int bid = blockIdx.x;
const int tid = threadIdx.x;
testStruct ts(ret0 [bid * blockDim.x + tid]);
int ret = tid * bid + tid - bid + ts.x;
for (int i = 1; i < 1; i++) {
for (int j = i; j < 1; j++) {
ret += i;
ret *= j;
ret %=99991;
ret++;
ret <<= 1;
}
}
testStruct retS(ret, 0);
ret0[bid * blockDim.x + tid] = retS;
}
profiler says:
gld uncoalesced: 26880
gst uncoalesced: 107520
with testScale<<<105, 256>>>
I think the alignment is no problem and data size seems to be okay with 2x4bytes
I am using WinXP32 on 8600M GS with cuda 2.3