I want all threads to access a single double precision value in the shared memory. Will a broadcast be done automatically, or should I do it manually? If so how?
I am using a compute 1.3 arch and cuda 2.3.
There is currently no way “all threads” can access a value stored in shared memory, because shared memory is per multiprocessor and there is no coherence between multiprocessors. If the double value is known ahead of the kernel launch and doesn’t change through the life of a single kernel launch, then use constant memory. It has cache and an automatic broadcast mechanism and is the fastest way to propagate constant values to all threads.
“If all threads read from the same shared memory address then a broadcast mechanism is automatically invoked and serialization is avoided.” – quoted from http://www.ddj.com/hpc-high-performance-computing/208801731
My question is for a double precision value.
In that section you are quoting (somewhat out of context), “all threads” refers to all active threads in a warp, not “all threads” as in every thread in a grid. If you are only worried about warp level values, then you shouldn’t have to do anything. The MP can (I believe) automatically do both 32 and 64 bit broadcasts out of shared memory. If you really are interested in all threads in a grid, my original reply stands. It cannot be done.
Thanks a lot!!