I’ve started optimizing my a cuda kernel, and i’d like to move some small arrays and larger types (like float4’s etc…) onto the shared memory instead of on my stack.
I’ve read the cuda programming manual but it’s quite vague on how to allocate propperly with avoiding bank conflicts.
Basically, my question is very simple, how can i place a variable onto shared memory unique to my thread, nothing needs to be shared between thread in my warp etc…
eg, take float a;
how would i put allocate shared memory so my threads can store the a arrays on shared memory, without any bank conflicts ?
I don’t fully understand the cuda manual,
if i’d do:
shared float a;
i’m assuming it will allocate an array of 8 floats on shared memory, but all threads in my warp will then read/write to the same values or not ? will they be unique to my thread or not ?