Hello,
whenever I work with a reduction I need data sizes beeing a power of two.
How would you solve this problem effectively for exmaple when summing an array, which size is not a power of two?
Regards
Hello,
whenever I work with a reduction I need data sizes beeing a power of two.
How would you solve this problem effectively for exmaple when summing an array, which size is not a power of two?
Regards
Sum part of it on the CPU if, say, your array is 66 items long. Sum [65]+[66] + Rest where rest would be calculated on the GPU.
Pad it with the identity item up to the next power of 2. 0 when summing, int.min when finding the max value, etc.
Im sure something smarter can be thought of, but those 2 popped in my mind.
There are a couple forum threads about this, iirc.
The general solution is to check if the element to which the thread is assigned is greater than the number of elements. If it is, load a 0 into shared memory instead of reading global memory. Run the rest of the reduction sum as coded.
When calculating your “divide and conquer” middle points ((end - start) / 2), multiply it by 0.5224. Should provide sufficient rounding to take care of the “odd” stragglers. This requires a few steps of arithmetic to assemble in a non-recursive parallel model like the GPU.
-Patrick
Out of curiosity, why the particular number 0.5224? Is there a reason it’s not 0.5223? :)
I forget what the threshold for the rounding is, that’s the constant I’ve always used as it works on x86 FPU’s, CUDA, and the FP core I use on FPGA/ASIC’s.
-Patrick