left side shows regular broadcast, the right side is what I would like to do
“x” is just some garbage elements
is it possible?
or is there an efficient alternative?
left side shows regular broadcast, the right side is what I would like to do
“x” is just some garbage elements
is it possible?
or is there an efficient alternative?
Shared Memory offers a broadcast mechnism, but not a selective one.
You have to implement it yourself.
consant memory can also be very fast when all threads are reading the same value.
But i guess in your case you won’t be reading in multiples of 32 so this might not be the solution for you.
Hmm this is just some quick idea i got… Maybe it could be fast to let ONE warp do 3 broadcasts that overlap each other? Some psuedo code
__shared__ float array[256];
__shared__ float vals[6*8]; // Your result (added a little extra length)
....
int offset = 256/3; // right ?
// broadcast one
vals[threadIdx.x] = array[0];
// broadcast two
vals[threadIdx.x+ 8] = array[offset];
// broadcast three
vals[threadIdx.x + 16] =arrat[2*offset];
I guess this might be a fast way to do it. You might have to think about the indexing more than i did but you get the basic idea?
Shared memory requests are per half-warp, consisting of 16 threads. There are no conflicts whatsoever between threads coming from different half-warps, and broadcast is done per half-warp too.
I think what Nikolai wants to do is:
vals[threadIdx.x]=array[threadIdx.x & ~0x7];
This memory operation will be resolved with a use of a broadcast mechanism but will be resolved in two memory requests (as if we had 2-bank conflict). A 2-way bank conflict is not much.
Jimmy’s introducing two transactions explicitly which is not helping.
My statement is motivated by what can be read in the Programming Guide:
Following this I understand that in our case, first half-warp can be serviced for example as follows:
Request 1: array[0] as broadcast and array[8] ordinarly. Threads 0-9 are done.
Request 2: array[8] as broadcast, threads 10-15 are done.
reading your replies gave me an idea that circumvented the whole need for the “selective broadcast mechanism”
thanks guys! External Image