As for Reduction, I wonder why the% operation is slow in CUDA.
In CUDA Run <<<blocks, threads >>>
I wonder how many blocks and threads run without error. I posted a block up to 2048 and got an error and I wonder if it’s out of the range of executable numbers.
Because integer division is hard. In other words, the modulo operation with a variable divisor is slow on all compute platforms that I am familiar with (quite a few), including CUDA. A modulo computation with a constant divisor, i.e. one known at run-time, on the other hand can be optimized well by compilers for all common compute platforms, including CUDA.