my question is quite simple , shared memory is “shared” among all threads within a specific Block.
however, whenever i use __syncthreads, for example to load my shared memory
lets say that ptr’s size is for example 1024 integers , and the main executed the following command : kernel <<<1024/256,256>>>(ARRAY);
which generates 4 Blocks, when each block contains 256 threads.
since i have 4 blocks , i assume im going to have to 4 sharedMemory , one for each block.
my question is after all calculation is done , is there anyway to combine all those sharedMemory (4 in number) variables into one.
i geuss that cuda global memory has something about it , but im not sure it’s the best way and also the fastest.
I guess you can use atomicAdd() function. There is no syncronization between blocks, but threads in the same block. You can use the function like this;
hey guys , can you give me some tips regarding hot to avoid barrier divergent.
seems like this subject keeps hunting me down.
all i know that it’s somehow connected to warps. i do understand that whenever we execute a kernel we send it as following <<<blocks amount, num threads>>>
whenever the SM decides how to execute (run our threads) it splits block’s threads into half warps and let the 16 cores run them simultaneously as long
as all cores run same instructions.whenever a core gets into a branch , most likely not all cores will follow same instruction and therefore
it will be executed one by one , so far so good.
the question how can i avoid those branches? because every program that i’ve seen so far contains IF and LOOP statements inside the kernel.
i’ve seen that we can use “stride” or something like that… but still that subject isn’t clear.
so as a CPU programmer how can i avoid those barriers?
i find it very difficult how one can come up with ways to avoid them?
for example
if (threadIdx.x % 2 == )
sum+=1;
else
sum+=2;
it’s easy to see that the following code is going to reduce my preformence because branch barrier was created.
what’s that altrnative way for this kind of calculation?
Hardware wise this would be incorrect, but programming wise it’s probably correct.
The shared memory is shared across all blocks per multi processor. However I guess this is only for the allocation of it. Syntactically and execution wise it might be seperate… at least that’s why guide says… “visible for all threads in a thread block”… So I can see how the “hardware capabilities/concepts” or conflicting with the “programming concepts”.
Also last thing I heard is that SyncThreads only works for a warp or a block or something but not all blocks or all warps or something like that, so you might want to check into that.
There also seem to be threadfence() synchronization functions which wait until all threads can see the updates.
The driver api has a context function though which can perform a synchronization on the entire context and this could be used on the host system to wait for everything to complete.
Hardware wise this would be incorrect, but programming wise it’s probably correct.
The shared memory is shared across all blocks per multi processor. However I guess this is only for the allocation of it. Syntactically and execution wise it might be seperate… at least that’s why guide says… “visible for all threads in a thread block”… So I can see how the “hardware capabilities/concepts” or conflicting with the “programming concepts”.
Also last thing I heard is that SyncThreads only works for a warp or a block or something but not all blocks or all warps or something like that, so you might want to check into that.
There also seem to be threadfence() synchronization functions which wait until all threads can see the updates.
The driver api has a context function though which can perform a synchronization on the entire context and this could be used on the host system to wait for everything to complete.
Yeah that is interesting, warpID is not mentioned under “build in” variables… so I am gonna check it out if I can find it somewhere else in the guide as mentioned above… External Image :)
The guide does not mention “warpID” though I can vaguely remember something about it…
Anyway perhaps he means WarpID = ThreadIdx.X / WarpSize;
Yeah that is interesting, warpID is not mentioned under “build in” variables… so I am gonna check it out if I can find it somewhere else in the guide as mentioned above… External Image :)
The guide does not mention “warpID” though I can vaguely remember something about it…
Anyway perhaps he means WarpID = ThreadIdx.X / WarpSize;
This is a pretty good question, I shall re-formulate it a little bit:
How does one access the shared memory ? For example from the host ?
My first guess would be it’s impossible, the shared memory is somewhere in the device and it cannot be accessed from the host, or it cannot be copied with a memcopy function.
I could be wrong though.
So my first guess would be the kernel has to copy it to global memory/a pointer.
I will look into it though to see if I can find perhaps some driver api which might be able to copy from shared memory.
Nope it doesn’t seem possible, you could try acquiring a pointer to the shared memory and then try to perform a copy from that pointer location to the host device but I would guess that wouldn’t work.
The shared memory is probably on a special chip or something.
Perhaps cuda 5.0 or so will offer a memory copy function to copy directly from/to shared memory from the host or so External Image
You will find B.2.3 of the programming guide interesting.
It shows and mentions how shared memory between blocks has the same indexing… so all threads in the block starts from the same base, this simplifies the copieing to main memory somewhat but not really I guess… it’s more or less the same External Image one base versus a base for each thread or block or kernel… but still it’s probably easier if it’s just one base External Image
But which thread is going to do the copy huh ? External Image
I guess with multiple threads the copy could also proceed in parallel External Image
Hmm now that I read it again it again mentions it’s only for all threads in the block… so not for all blocks… since you have multiple blocks… each block will still have to do it’s own copy part I guess External Image
This is a pretty good question, I shall re-formulate it a little bit:
How does one access the shared memory ? For example from the host ?
My first guess would be it’s impossible, the shared memory is somewhere in the device and it cannot be accessed from the host, or it cannot be copied with a memcopy function.
I could be wrong though.
So my first guess would be the kernel has to copy it to global memory/a pointer.
I will look into it though to see if I can find perhaps some driver api which might be able to copy from shared memory.
Nope it doesn’t seem possible, you could try acquiring a pointer to the shared memory and then try to perform a copy from that pointer location to the host device but I would guess that wouldn’t work.
The shared memory is probably on a special chip or something.
Perhaps cuda 5.0 or so will offer a memory copy function to copy directly from/to shared memory from the host or so External Image
You will find B.2.3 of the programming guide interesting.
It shows and mentions how shared memory between blocks has the same indexing… so all threads in the block starts from the same base, this simplifies the copieing to main memory somewhat but not really I guess… it’s more or less the same External Image one base versus a base for each thread or block or kernel… but still it’s probably easier if it’s just one base External Image
But which thread is going to do the copy huh ? External Image
I guess with multiple threads the copy could also proceed in parallel External Image
Hmm now that I read it again it again mentions it’s only for all threads in the block… so not for all blocks… since you have multiple blocks… each block will still have to do it’s own copy part I guess External Image