throughput of warp vote functions?

There’s not much info out there on warp vote funcs, neither in the cuda prog. guide nor the google search… :">

So, some questions on warp vote functions:

  1. Is it possible that the predicate to be evaluated is different for different threads in a warp? An example predicate would be (arr1[threadIdx.x] == arr2[threadIdx.x]).
  2. Whether internally they’ll cause the warps to diverge if the predicate evaluates differently for different threads in the warp?
  3. Is there any example code OR an algorithm which could benefit from warp vote functions?
  4. What is the throughput of warp vote functions, __all and __any?

Even if one can provide me with some pointers to these information, that would be much appreciated…


Throughput is basically one op, just like a simple add. Can’t be any faster!

The vote operations have been useful to me when you have a thread that needs to load some largish data from global memory… then the whole warp can help load say 256 bytes of data into shared instead of having one thread use many, many memory transactions. (in one example case, it’s whether any thread needs to look at a voxel’s worth of polygon data.)

Really appreciate your response…

Throughput is basically one op, just like a simple add. Can’t be any faster!
Will this be the case even if the threads in a warp evaluate differently on the predicate given?

It’s worth noting that there is an equivalent to “[font=“Courier New”]vote.any.pred[/font]” documented/discovered here:

If your target GPU is at Compute Capability 1.0/1.1 then this can be invaluable in implementing whole-warp loaders like SPWorley describes.


Very cool!
Thanks allan for sharing the link… This link also has some pretty nice algos! :rolleyes:

I was reading Appendix G4 Compute Capability 2.0 in the CUDA Programming Guide Version 3.0 and noticed that page 148 states:

So that Aggregate “GPU Any” technique should still work on Compute Capability 2.0 devices.