Doubt on __threadfence() require a detail description of this function.

Hello every one,

NVIDIA documentation gives the following description about this function

waits until all global and shared memory accesses made by the calling thread prior to
__threadfence() are visible to all threads in the device.

My doubt is, Is it visible to all the threads in all the blocks??

It should be , because there is one more function __threadfence_block() and its description is

waits until all global and shared memory accesses made by the calling thread prior to
__threadfence_block() are visible to all threads in the thread block.

One more doubt is

if(blockIndex == 0)
{



if(threadIndex == 0)
count = 1; //Count is in global memory
__threadfence();


}

//Common code to all blocks




Index = count * 1;

Will all blocks (other than zero block) read the value of count as 1??

Thank you very much.

With love and regards
Praveen.

No. That’s the difference between __threadfence() and __threadfence_block().

__threadfence waits for a thread’s writes to be settled (likely it’s actually all threads in a warp, but the guarantee is per thread).

__threadfence_block waits for all threads in the block.

Your later example having block 0 write some value guaranteed to be seen by all other blocks will not work. There is no kernel-wide block synchronization mechanism finer than a kernel launch.

In the example you give, it is almost guaranteed that everybody else - who has nothing to do - will rush ahead and get the old value. Only the threads in warp zero are affected by the __threadfence() … a syncthreads() after the ‘if’ section could fix that though.

If you read a little further in the documentation, there is an example given on how to use __threadfence() and why it is there in the first place. It will solve the problem that incrementing a global index to a global array of data after you have written out results to the array, may happen before that data has reached its destination. I guess this a side effect of having several memory banks and controllers, not necessarily all equally busy.

Hello ,

Thanks to SPWorley and jma for their valuable reply.

waits until all global and shared memory accesses made by the calling thread prior to
__threadfence() are visible to all threads in the device.

Here all the threads in the device does not mean that all the threads which we created on the device, it only means that threads in that particular warp.

Coming to that example in the NVIDIA documentation after description of __threadfence() (http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf and it in the 117 page)

The last block which ever puts the value of result (I am referring to that example in documentation), computes the final sum by adding all the partial sums computed by each block and stored in result which is there in the global memory.

Suppose if i want to use that final sum after that, there is no way to do since there is no block synchronization possible.

In my application, kernel does not end there, it uses that final sum and does further computation by all blocks.

So i cant proceed in the same kernel, I have to break my kernel.
I have to launch multiple kernels.

Is it ok?? please correct me and suggest me for better way of implementation.

Will it be overhead to launch multiple kernels rather that one kernel??

One more doubt but it is not relevant to this thread.
I am getting correct result when i ran with device emulation mode but getting different result when i ran on device.
what will be the possible error for this behavior?
sorry for asking in this thread itself.

Thanks a lot.

With love and regards
Praveen

Yes, it does not mean that all threads will wait, they won’t!, and that is not what Nvidia wrote either.

It means that this thread will wait for those writes done by this thread to complete all the way down through buffers, queues and cache to main memory so that everybody else (who are not waiting for anything!) can see it. That is also what Nvidia wrote. Why we are all desperately trying to read something different into that passage is a matter of psychology and wishful thinking.

There is a thread about how to create a global thread barrier in the General CUDA GPU Computing Discussion forum:
http://forums.nvidia.com/index.php?showtopic=92819

Note that this is only possible in the special case where ‘number of blocks’ equals ‘number of multiprocessors’. The application will hang otherwise!

The overhead of a kernel relaunch is generally in the same ballpark as the suggested implementations of global_thread_barrier. The tricky part - if you want to do better - is arranging for something useful to do while your block is waiting for every other block to complete, rather than disruptively spinning on the same spot at an insane speed.

As for why your application runs correctly in emulation and not on the device, this will typically be because the “emulation” is bogus and runs everything sequentially without the possibility of a race condition ever to occur, whereas the device really runs everything in parallel with your race conditions popping up all over the place … Or it could be that neither gives a “correct” result, but instead gives two slightly deviating “incorrect” results, both with imprecise finite precision. This due to differences in rounding modes.

Check out Vasily’s paper on global-barrier: http://www.eecs.berkeley.edu/Pubs/TechRpts…ECS-2008-49.pdf

I think this global-barrier is only for “Active blocks” – So you could spawn ur kernel such that the number of active blocks == total number of blocks… That might help. Anyway, read the section fully and try to understand what he is saying… Great engineering work!

Man, I cant believe 8800 GTX has 2 levels of caches… undocumented… Been using the card so long ;-)