write data from global to shared memory strange thing in SDK sample oclMatrixMul

I just had a closer look at the Matrix multiplication sample (oclMatrixMul) within Nvidas SDK. Now I’m wondering about how data is written from global to local memory, because it is written in the kind of

As[local_id_x * BLOCK_SIZE + local_id_y]

Correct me if i’m wrong but in this case we are writing in an uncoalesced way. I read somewhere that this doesn’t matter in case of shared memory but if i change the code above to:

As[local_y * BLOCK_SIZE + local_x]

The execution time is reduced by more than 50 %. I’ve tested this behavior on Windows 7 64-bit (8500GT) and on my MacBook (9400M).

It both cases performance increases were significantly.

original lines

#define AS(i, j) As[i + j * BLOCK_SIZE]

#define BS(i, j) Bs[i + j * BLOCK_SIZE]

...

AS(ty, tx) = A[a + dim * ty + tx];

BS(ty, tx) = B[b + dim * ty + tx];

// Synchronize to make sure the matrices are loaded

barrier(CLK_LOCAL_MEM_FENCE);

for (int k = 0; k < BLOCK_SIZE; ++k)

	 Csub += AS(ty, k) * BS(k, tx);

...

modified lines:

AS(tx, ty) = A[a + dim * ty + tx];

BS(tx, ty) = B[b + dim * ty + tx];

// Synchronize to make sure the matrices are loaded

barrier(CLK_LOCAL_MEM_FENCE);

for (int k = 0; k < BLOCK_SIZE; ++k)

	Csub += AS(k, ty) * BS(tx, k);

…

Does Nvidia have any reason for writing data from global to shared memory?

You don’t need to be coalesced in local memory, because it’s on-chip. However, you need to avoid bank conflict. If BLOCK_SIZE is 16, then accessing with

As[local_x * BLOCK_SIZE + local_y]

causes many bank conflicts because As[0] and As[16] share the same bank (also As[1] and As[17], As[2] and As[18], etc.). Accessing with

As[local_y * BLOCK_SIZE + local_x]

lets each thread reads from a different bank so there is no bank conflicts.

One way to avoid bank conflict in this case is to pad the array by one, i.e.

As[local_x * (BLOCK_SIZE + 1) + local_y]

thanks for you answer. Just to be clear isn’t

As[local_y * BLOCK_SIZE + local_x]

and

As[local_x * (BLOCK_SIZE + 1) + local_y]

the same (BLOCK_SIZE = 16)?

In first case we wirte to As[0], As[1], As[2] and so on depending on local_x.
In the 2nd case we write to As[0], As[17], As[34], As[51] etc.

So we write to local memory bank
0 for As[0]
1 for As[17] as well as for As[1]
2 for As[34] as well as for As[2]
3 for As[51] as well as for As[3]

I’m not sure if thats right, so correct me if I’m wrong.

Yes, you are right. If you only need to access in one way, you don’t need to pad the array, because

As[local_y * BLOCK_SIZE + local_x]

is fine. However, if you need to access the array in both row and column major, e.g. you want to do both

As[local_y * BLOCK_SIZE + local_x]

and

As[local_x * BLOCK_SIZE + local_y]

then you’ll have to pad the array to make sure that there’s no bank conflict in both situations.

Ok I think i got it.

If the kernel contains something like:

for (int i = 0; i < BLOCK_SIZE; i++)

	tmp += A_loc(i + BLOCK_SIZE * local_y);

Shared memory would be accessed in column major. Hence bank confilcts will occur.