__syncthreads and shared memory

Hey.

my question is quite simple , shared memory is “shared” among all threads within a specific Block.
however, whenever i use __syncthreads, for example to load my shared memory


global_ void kernel(int * ptr)
{
extern int sharedMemory[];
sharedMemory[threadIdx.x] = 1;
__syncthreads();

}

lets say that ptr’s size is for example 1024 integers , and the main executed the following command :
kernel <<<1024/256,256>>>(ARRAY);
which generates 4 Blocks, when each block contains 256 threads.
since i have 4 blocks , i assume im going to have to 4 sharedMemory , one for each block.
my question is after all calculation is done , is there anyway to combine all those sharedMemory (4 in number) variables into one.
i geuss that cuda global memory has something about it , but im not sure it’s the best way and also the fastest.

thanks for your help!

I guess you can use atomicAdd() function. There is no syncronization between blocks, but threads in the same block. You can use the function like this;

atomicAdd(&result, sharememory[threadIdx.x]);

Plus, you can use “reduction” algorithm for each block and collect the results into one via atomicAdd.

hey guys , can you give me some tips regarding hot to avoid barrier divergent.
seems like this subject keeps hunting me down.
all i know that it’s somehow connected to warps. i do understand that whenever we execute a kernel we send it as following <<<blocks amount, num threads>>>
whenever the SM decides how to execute (run our threads) it splits block’s threads into half warps and let the 16 cores run them simultaneously as long
as all cores run same instructions.whenever a core gets into a branch , most likely not all cores will follow same instruction and therefore
it will be executed one by one , so far so good.

the question how can i avoid those branches? because every program that i’ve seen so far contains IF and LOOP statements inside the kernel.
i’ve seen that we can use “stride” or something like that… but still that subject isn’t clear.

thanks for your help!

You don’t need to avoid branches at all. Just try to avoid conditional branches where some threads of a warp take the branch and others don’t.

if statements = conditional branches

and i can say the same thing about loop statements , can’t i?

Yes.

so as a CPU programmer how can i avoid those barriers?
i find it very difficult how one can come up with ways to avoid them?
for example

if (threadIdx.x % 2 == )
sum+=1;
else
sum+=2;

it’s easy to see that the following code is going to reduce my preformence because branch barrier was created.
what’s that altrnative way for this kind of calculation?

thanks.

sum += 1 + threadIdx.x % 2;

Actually,

sum += 1 + (threadIdx.x & 1);

would be a lot faster, but the compiler should be able to transform the former into the latter.

Hi,

IF and LOOP is not an issue if you handle it correctly.

Suppose that you have 64 threads per block. This means that you have 2 warps per block.

Now suppose that you have something like this:

if(warpID == 0){

	//do this

}else{

	//do that

}

this code will not serialize anything. This is due to that each thread withing a warp takes the same path.

You could read about this in the programming guide 3.2 chapter [4.1, 5.4.2]

Wow thank you both , i didn’t know the developer can control the warps management

Wow thank you both , i didn’t know the developer can control the warps management

Hardware wise this would be incorrect, but programming wise it’s probably correct.

The shared memory is shared across all blocks per multi processor. However I guess this is only for the allocation of it. Syntactically and execution wise it might be seperate… at least that’s why guide says… “visible for all threads in a thread block”… So I can see how the “hardware capabilities/concepts” or conflicting with the “programming concepts”.

Also last thing I heard is that SyncThreads only works for a warp or a block or something but not all blocks or all warps or something like that, so you might want to check into that.

There also seem to be threadfence() synchronization functions which wait until all threads can see the updates.

The driver api has a context function though which can perform a synchronization on the entire context and this could be used on the host system to wait for everything to complete.

Hardware wise this would be incorrect, but programming wise it’s probably correct.

The shared memory is shared across all blocks per multi processor. However I guess this is only for the allocation of it. Syntactically and execution wise it might be seperate… at least that’s why guide says… “visible for all threads in a thread block”… So I can see how the “hardware capabilities/concepts” or conflicting with the “programming concepts”.

Also last thing I heard is that SyncThreads only works for a warp or a block or something but not all blocks or all warps or something like that, so you might want to check into that.

There also seem to be threadfence() synchronization functions which wait until all threads can see the updates.

The driver api has a context function though which can perform a synchronization on the entire context and this could be used on the host system to wait for everything to complete.

Yeah that is interesting, warpID is not mentioned under “build in” variables… so I am gonna check it out if I can find it somewhere else in the guide as mentioned above… ;) :)

The guide does not mention “warpID” though I can vaguely remember something about it…

Anyway perhaps he means WarpID = ThreadIdx.X / WarpSize;

Bye,

Skybuck.

Yeah that is interesting, warpID is not mentioned under “build in” variables… so I am gonna check it out if I can find it somewhere else in the guide as mentioned above… ;) :)

The guide does not mention “warpID” though I can vaguely remember something about it…

Anyway perhaps he means WarpID = ThreadIdx.X / WarpSize;

Bye,

Skybuck.

This is a pretty good question, I shall re-formulate it a little bit:

How does one access the shared memory ? For example from the host ?

My first guess would be it’s impossible, the shared memory is somewhere in the device and it cannot be accessed from the host, or it cannot be copied with a memcopy function.

I could be wrong though.

So my first guess would be the kernel has to copy it to global memory/a pointer.

I will look into it though to see if I can find perhaps some driver api which might be able to copy from shared memory.

Nope it doesn’t seem possible, you could try acquiring a pointer to the shared memory and then try to perform a copy from that pointer location to the host device but I would guess that wouldn’t work.

The shared memory is probably on a special chip or something.

Perhaps cuda 5.0 or so will offer a memory copy function to copy directly from/to shared memory from the host or so ;)

You will find B.2.3 of the programming guide interesting.

It shows and mentions how shared memory between blocks has the same indexing… so all threads in the block starts from the same base, this simplifies the copieing to main memory somewhat but not really I guess… it’s more or less the same ;) one base versus a base for each thread or block or kernel… but still it’s probably easier if it’s just one base ;)

But which thread is going to do the copy huh ? ;)

I guess with multiple threads the copy could also proceed in parallel ;)

Hmm now that I read it again it again mentions it’s only for all threads in the block… so not for all blocks… since you have multiple blocks… each block will still have to do it’s own copy part I guess ;)

This is a pretty good question, I shall re-formulate it a little bit:

How does one access the shared memory ? For example from the host ?

My first guess would be it’s impossible, the shared memory is somewhere in the device and it cannot be accessed from the host, or it cannot be copied with a memcopy function.

I could be wrong though.

So my first guess would be the kernel has to copy it to global memory/a pointer.

I will look into it though to see if I can find perhaps some driver api which might be able to copy from shared memory.

Nope it doesn’t seem possible, you could try acquiring a pointer to the shared memory and then try to perform a copy from that pointer location to the host device but I would guess that wouldn’t work.

The shared memory is probably on a special chip or something.

Perhaps cuda 5.0 or so will offer a memory copy function to copy directly from/to shared memory from the host or so ;)

You will find B.2.3 of the programming guide interesting.

It shows and mentions how shared memory between blocks has the same indexing… so all threads in the block starts from the same base, this simplifies the copieing to main memory somewhat but not really I guess… it’s more or less the same ;) one base versus a base for each thread or block or kernel… but still it’s probably easier if it’s just one base ;)

But which thread is going to do the copy huh ? ;)

I guess with multiple threads the copy could also proceed in parallel ;)

Hmm now that I read it again it again mentions it’s only for all threads in the block… so not for all blocks… since you have multiple blocks… each block will still have to do it’s own copy part I guess ;)

Hi,

maybe i was unclear when specifying the warpID. This is just for illustration on how to use it.

An easy way to keep track of the warpID is to launch 32 threads in xDim and a number of yDim. Then one could use the threadIdx.y as the warpID.

Hi,

maybe i was unclear when specifying the warpID. This is just for illustration on how to use it.

An easy way to keep track of the warpID is to launch 32 threads in xDim and a number of yDim. Then one could use the threadIdx.y as the warpID.