The following code gives inconsistent result depending on whether we execute it over GPU or CPU:
#include <iostream>
#include <cmath>
int main(){
const int N = 10;
float array[N];
for (int i=0; i<N; ++i) {
array[i] = HUGE_VALF; // <= infinity
}
float min = HUGE_VALF;
#pragma acc parallel loop vector reduction(min:min) copyin(array)
for (auto i = 0; i < N; i++) {
if (array[i] < min) min = array[i];
}
std::cout << min << std::endl;
}
This (correctly) returns inf
when executed over CPU but returns 3.40282e+38
(MAX_FLOAT
) when executed over GPU.
The bug is also present when swapping min
with max
and HUGE_VALF
with -HUGE_VALF
and when using double
s instead of float
s.
This was tested over versions 23.7 and 24.5 on Ampere and Turing architecture. I wasn’t able to test with the 25.7 version of the SDK or newer cuda arch, I hope it wasn’t already corrected.
The issue here is that the OpenACC standard defines the initial value for the private copy of min reduction variable is set to the largest value of that type, i.e. MAX_FLOAT in this case.
See section 2.5.15: https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.4.pdf
I’ve submitted an RFE, TPR #37611, to see what, if anything, we can do to help with this case.
-Mat
Hello Mat, thanks you for your answer and the pointer to the OpenACC standard.
I see your point but I will raise two counterarguments:
- as I see it, the largest representable value for floats is
HUGE_VALF
and not MAX_FLOAT
,
- it doesn’t explain why the program behaves differently depending on where it is executed.
Anyways, thanks again for your answer and thank you for submitting a bug report.
Fair point, and engineering may change the initial partial reduction value to HUGE_VALF, though I don’t know what other considerations are needed.
For multicore, reductions are much simpler so no initial value is needed.