I just had a closer look at the Matrix multiplication sample (oclMatrixMul) within Nvidas SDK. Now I’m wondering about how data is written from global to local memory, because it is written in the kind of
As[local_id_x * BLOCK_SIZE + local_id_y]
Correct me if i’m wrong but in this case we are writing in an uncoalesced way. I read somewhere that this doesn’t matter in case of shared memory but if i change the code above to:
As[local_y * BLOCK_SIZE + local_x]
The execution time is reduced by more than 50 %. I’ve tested this behavior on Windows 7 64-bit (8500GT) and on my MacBook (9400M).
It both cases performance increases were significantly.
original lines
#define AS(i, j) As[i + j * BLOCK_SIZE]
#define BS(i, j) Bs[i + j * BLOCK_SIZE]
...
AS(ty, tx) = A[a + dim * ty + tx];
BS(ty, tx) = B[b + dim * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
...
modified lines:
AS(tx, ty) = A[a + dim * ty + tx];
BS(tx, ty) = B[b + dim * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(k, ty) * BS(tx, k);
…
Does Nvidia have any reason for writing data from global to shared memory?
You don’t need to be coalesced in local memory, because it’s on-chip. However, you need to avoid bank conflict. If BLOCK_SIZE is 16, then accessing with
As[local_x * BLOCK_SIZE + local_y]
causes many bank conflicts because As[0] and As[16] share the same bank (also As[1] and As[17], As[2] and As[18], etc.). Accessing with
As[local_y * BLOCK_SIZE + local_x]
lets each thread reads from a different bank so there is no bank conflicts.
One way to avoid bank conflict in this case is to pad the array by one, i.e.