beginner question regarding shared memory

sandoz31 · November 13, 2009, 6:24pm

Hello :)

I’ve started optimizing my a cuda kernel, and i’d like to move some small arrays and larger types (like float4’s etc…) onto the shared memory instead of on my stack.
I’ve read the cuda programming manual but it’s quite vague on how to allocate propperly with avoiding bank conflicts.

Basically, my question is very simple, how can i place a variable onto shared memory unique to my thread, nothing needs to be shared between thread in my warp etc…

eg, take float a[8];

how would i put allocate shared memory so my threads can store the a[8] arrays on shared memory, without any bank conflicts ?
I don’t fully understand the cuda manual,
if i’d do:

shared float a[8];

i’m assuming it will allocate an array of 8 floats on shared memory, but all threads in my warp will then read/write to the same values or not ? will they be unique to my thread or not ?

Thanks,
sando

avidday · November 15, 2009, 3:42pm

Shared memory is allocated and shared at a block level. If you want 8 floats per thread in shared memory which are not common to all threads in a block, then you would have to allocate 8 times the number of threads per block and then index into those using the thread index. That might have some negative implications for occupancy, depending the on register usage of your kernel and how many threads per block you choose.

sandoz31 · November 15, 2009, 4:26pm

Hi,

Thanks for the reply,

but i don’t understand how to avoid the bankconflicts the manual talks about.

eg,

define THREADWIDTH 4

define THREADHEIGHT 16

…

dim3 block(THREADWIDTH, THREADHEIGHT, 1);

so in my kernel i’d allocate:

shared float[THREADWIDTH][THREADHEIGHT][8];

and in my current kernel i’d use for example:

{

const int thx = threadIdx.x;

const int thy = threadIdx.y;

// Simple store and load

[thx][thy][0] = 1.f;

float t = [thx][thy][0];

}

Would that cause bankconflicts ?

And if yeah, what to do ?

I don’t see how the bank stuff comes into the picture.

Thanks,

Sando

Jimmy_Pettersson · November 15, 2009, 5:03pm

You’ve probably already read this in the PG but:

There are 16 memory banks, each bank is 4 bytes wide. This means that if you write to shared memory the first 32 bytes will be written into unique memory banks and thus working at register speed. When you get to bytes 33-36 however you will be writing back into memory bank #1 again meaning that these two writes will be serialized one after the other.

In general this isn’t a problem since you for each SM you only have one warp ( 32 threads ) being executed at a time. Now if each thread is writing 4 bytes this means that thread #1 and thread #17 will both be writing to bank #1 BUT these are serialized after one another anyways so it isn’t much of a bother. Since physically there are only 8 SP’s doing the work of 32 threads in 4 “fast” clock cycles.

One example of when this can be an issue is if you’ve allocated

shared float shared_matrix[16][16]

and want to write to just one column at a time. This is typical when for example doing a matrix transpose. You will then be writing all you’re values into the same memory bank. Fix ?

do for ex

shared float shared_matrix[16][16 + 1]

This adds an additional column that means that element shared_matrix[0][0] and shared_matrix[1][0] are no longer in the same memory bank.

Another typical problem would be when you are using doubles and also float2 ( quit sure… ) datatypes…

When reading writing these 8 byte types thread #1 and thread #9 will be in the same memory bank.

Hope this was in any way helpful.

sandoz31 · November 16, 2009, 11:07pm

You’ve probably already read this in the PG but:

There are 16 memory banks, each bank is 4 bytes wide. This means that if you write to shared memory the first 32 bytes will be written into unique memory banks and thus working at register speed. When you get to bytes 33-36 however you will be writing back into memory bank #1 again meaning that these two writes will be serialized one after the other.

In general this isn’t a problem since you for each SM you only have one warp ( 32 threads ) being executed at a time. Now if each thread is writing 4 bytes this means that thread #1 and thread #17 will both be writing to bank #1 BUT these are serialized after one another anyways so it isn’t much of a bother. Since physically there are only 8 SP’s doing the work of 32 threads in 4 “fast” clock cycles.

One example of when this can be an issue is if you’ve allocated

shared float shared_matrix[16][16]

and want to write to just one column at a time. This is typical when for example doing a matrix transpose. You will then be writing all you’re values into the same memory bank. Fix ?

do for ex

shared float shared_matrix[16][16 + 1]

This adds an additional column that means that element shared_matrix[0][0] and shared_matrix[1][0] are no longer in the same memory bank.

Another typical problem would be when you are using doubles and also float2 ( quit sure… ) datatypes…

When reading writing these 8 byte types thread #1 and thread #9 will be in the same memory bank.

Hope this was in any way helpful.

thanks a lot, that explains a lot ;)

Sando