write data from global to shared memory strange thing in SDK sample oclMatrixMul

noxnet · January 28, 2010, 7:12pm

I just had a closer look at the Matrix multiplication sample (oclMatrixMul) within Nvidas SDK. Now I’m wondering about how data is written from global to local memory, because it is written in the kind of

As[local_id_x * BLOCK_SIZE + local_id_y]

Correct me if i’m wrong but in this case we are writing in an uncoalesced way. I read somewhere that this doesn’t matter in case of shared memory but if i change the code above to:

As[local_y * BLOCK_SIZE + local_x]

The execution time is reduced by more than 50 %. I’ve tested this behavior on Windows 7 64-bit (8500GT) and on my MacBook (9400M).

It both cases performance increases were significantly.

original lines

#define AS(i, j) As[i + j * BLOCK_SIZE]

#define BS(i, j) Bs[i + j * BLOCK_SIZE]

...

AS(ty, tx) = A[a + dim * ty + tx];

BS(ty, tx) = B[b + dim * ty + tx];

// Synchronize to make sure the matrices are loaded

barrier(CLK_LOCAL_MEM_FENCE);

for (int k = 0; k < BLOCK_SIZE; ++k)

	 Csub += AS(ty, k) * BS(k, tx);

...

modified lines:

AS(tx, ty) = A[a + dim * ty + tx];

BS(tx, ty) = B[b + dim * ty + tx];

// Synchronize to make sure the matrices are loaded

barrier(CLK_LOCAL_MEM_FENCE);

for (int k = 0; k < BLOCK_SIZE; ++k)

	Csub += AS(k, ty) * BS(tx, k);

â€¦

Does Nvidia have any reason for writing data from global to shared memory?

pcchen · January 31, 2010, 6:18pm

You don’t need to be coalesced in local memory, because it’s on-chip. However, you need to avoid bank conflict. If BLOCK_SIZE is 16, then accessing with

As[local_x * BLOCK_SIZE + local_y]

causes many bank conflicts because As[0] and As[16] share the same bank (also As[1] and As[17], As[2] and As[18], etc.). Accessing with

As[local_y * BLOCK_SIZE + local_x]

lets each thread reads from a different bank so there is no bank conflicts.

One way to avoid bank conflict in this case is to pad the array by one, i.e.

As[local_x * (BLOCK_SIZE + 1) + local_y]

noxnet · February 2, 2010, 4:08pm

thanks for you answer. Just to be clear isn’t

As[local_y * BLOCK_SIZE + local_x]

and

As[local_x * (BLOCK_SIZE + 1) + local_y]

the same (BLOCK_SIZE = 16)?

In first case we wirte to As[0], As[1], As[2] and so on depending on local_x.
In the 2nd case we write to As[0], As[17], As[34], As[51] etc.

So we write to local memory bank
0 for As[0]
1 for As[17] as well as for As[1]
2 for As[34] as well as for As[2]
3 for As[51] as well as for As[3]

I’m not sure if thats right, so correct me if I’m wrong.

pcchen · February 3, 2010, 11:42pm

Yes, you are right. If you only need to access in one way, you don’t need to pad the array, because

As[local_y * BLOCK_SIZE + local_x]

is fine. However, if you need to access the array in both row and column major, e.g. you want to do both

As[local_y * BLOCK_SIZE + local_x]

and

As[local_x * BLOCK_SIZE + local_y]

then you’ll have to pad the array to make sure that there’s no bank conflict in both situations.

noxnet · February 4, 2010, 10:10am

Ok I think i got it.

If the kernel contains something like:

for (int i = 0; i < BLOCK_SIZE; i++)

	tmp += A_loc(i + BLOCK_SIZE * local_y);

Shared memory would be accessed in column major. Hence bank confilcts will occur.

Topic		Replies	Views
Local vs Shared Memory execution slows down when using shared memory CUDA Programming and Performance	6	3326	October 14, 2009
Shared memory question CUDA Programming and Performance	27	7761	June 23, 2008
Write/read shared memory on compute capability 2.1 CUDA Programming and Performance	3	1000	November 21, 2012
Coalescing global memory and avoiding shared bank conflicts Do I need to use this complex of indexin CUDA Programming and Performance	3	3285	March 30, 2009
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3696	January 10, 2010
char global memory access optimization CUDA Programming and Performance	17	12080	May 31, 2010
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5422	September 6, 2008
Memory coalescing in one thread CUDA Programming and Performance	17	16871	March 31, 2011
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4734	May 12, 2015
Shared vs Global Memory impl. of vector matrix mulltiplication CUDA Programming and Performance	3	10761	February 8, 2008

write data from global to shared memory strange thing in SDK sample oclMatrixMul

Related topics