There’s not much info out there on warp vote funcs, neither in the cuda prog. guide nor the google search… :">
So, some questions on warp vote functions:
Is it possible that the predicate to be evaluated is different for different threads in a warp? An example predicate would be (arr1[threadIdx.x] == arr2[threadIdx.x]).
Whether internally they’ll cause the warps to diverge if the predicate evaluates differently for different threads in the warp?
Is there any example code OR an algorithm which could benefit from warp vote functions?
What is the throughput of warp vote functions, __all and __any?
Even if one can provide me with some pointers to these information, that would be much appreciated…
The vote operations have been useful to me when you have a thread that needs to load some largish data from global memory… then the whole warp can help load say 256 bytes of data into shared instead of having one thread use many, many memory transactions. (in one example case, it’s whether any thread needs to look at a voxel’s worth of polygon data.)
Throughput is basically one op, just like a simple add. Can’t be any faster!
Will this be the case even if the threads in a warp evaluate differently on the predicate given?