I am parallelizing a mathematical method which involves reduction of the absolute values of a float array, so I employed one of the solutions proposed by Mark Harris and adapted to my problem.

If you are also starting from scratch, like me, you can read the entire conversation at this thread: https://devtalk.nvidia.com/default/topic/1038617/cuda-programming-and-performance/understanding-and-adjusting-mark-harriss-array-reduction/

Go to txbob’s last answer, copy and paste the code and you have a working bare-bones normal reduction (not absolute, which we will discuss below).

=== BEGIN OF JOURNAL POST ===

My particular case has float arrays as input and output, dimension defined by user, input array populated with samples in the range [-1:+1]. If you already have your favorite function implemented, look for the line that loads the sample into the device’s shared memory:

```
sdata[tid] += array_in[i];
```

You will be tempted to just do an “abs(array_in[i])”. It totally broke the performance on my current setup. Instead of calling abs() regardless of the sample, I replaced with a ternary operator conditional:

```
array_in[i] >= 0 ? sdata[tid] += array_in[i] : sdata[tid] += array_in[i] * -1;
```

I thought the conditional and a multiplication by -1 in the case of a negative value would still impact the performance against the normal reduction, but it was not the case. Running the normal and absolute reductions multiple times on a 250 million elements array took an average 0,006s on a 1080Ti.

To confirm the correctness of the reduction, a single-threaded CPU reduction was done (discussed in txbob’s answer):

```
float cpu_sum = 0.0;
for(size_t i = 0; i < array_length; i++)
cpu_sum += abs(array_in[i]);
cout << cpu_sum << endl;
```

Due to floating-point arithmetics and precision, the CPU and GPU provide the same result up to the 5th digit, and after that you begin to see rounding differences but the bulk of the reduction is good enough, specially for a huge array. And it is exactly here that extra care is needed. Depending on the range of your samples, the accumulator will overflow, sooner or later.

For the range [-1:+1] and the way my data is generated, an array of 35M elements will already overflow the CPU float accumulator, stopping at 16777216. HOWEVER, I was puzzled to see the device’s float (which is also a 4 bytes type) taking much bigger array length until it begins to overflow. In fact, I stopped testing at 500M elements and it was still printing increased values. But the decimal part was long gone, so MAYBE, and someone more qualified can confirm or deny, at an overflow situation the device somehow gives up the bits for the mantissa and just represents the number as an integer?!

Regardless, you might want to change your accumulator (that is, your shared memory storage) from float to double, so your declaration in the reduction looks like:

```
extern volatile __shared__ double sdata[];
```

But then, the reduction stores the final value in the first element of the input array. So you are left with the problem of having a double accumulator but a float final storage… will you change everything from float to double to accomodate this, and reduce the amount of work you can compute at a time due to the now twice as big storage? Just a few things for you to consider if you have a similar problem to solve.

The other thing to be aware of is: the reduction overwrites the first element of the input array. If you, just like me, have to preserve it in order to run other computations, store it in a temporary variable in the device, so the program doesn’t have to copy the whole array from device to host memory because of 1 element. The function that launches reduction kernels and other computations account for this, so find it below (including my reduction launching parameters) and change to what you need:

```
bool big_computation(void)
{
const unsigned int grid_Size = 256, block_Size = 128;
float *reduction_tmp; // Stores the final reduction and the first input element
__CUDA_CALL(cudaMallocManaged(&reduction_tmp, 2 * sizeof(float)));
__CUDA_CALL(cudaMemset(reduction_tmp, 0, 2 * sizeof(float)));
__CUDA_CALL(cudaDeviceSynchronize()); // Needed due to allocation with cudaMallocManaged
// Copy the first element of input array to second element of temp array, pointer arithmetics
__CUDA_CALL(cudaMemcpy(reduction_tmp + 1, array_in, sizeof(float), cudaMemcpyDeviceToDevice));
__CUDA_CALL(cudaDeviceSynchronize());
// Runs the reduction
cuda_absreduction_1D < block_Size > <<< grid_Size, block_Size, block_Size * sizeof(float) >>> (array_in, array_out, array_length);
cuda_absreduction_1D < block_Size > <<< 1, block_Size, block_Size * sizeof(float) >>> (array_out, array_in, grid_Size);
__CUDA_CALL(cudaDeviceSynchronize());
// Copy things back to their places before running other computations
// Final reduction from array_in[0] to reduction_tmp[0]
__CUDA_CALL(cudaMemcpy(reduction_tmp, array_in, sizeof(float), cudaMemcpyDeviceToDevice));
__CUDA_CALL(cudaDeviceSynchronize());
// First sample from reduction_tmp[1] to array_in[0]
__CUDA_CALL(cudaMemcpy(array_in, reduction_tmp + 1, sizeof(float), cudaMemcpyDeviceToDevice));
__CUDA_CALL(cudaDeviceSynchronize());
// Here I call other functions that depend on the result of the reduction
// Finished, deallocate the temp array
__CUDA_CALL(cudaFree(reduction_tmp));
return true;
}
```

__CUDA_CALL is that macro that we see in countless CUDA tutorials that does error checking on CUDA commands. This one returns false at any error, and if all went fine, my function returns true. On the thread mentioned above, check the reply by Njuffa in the end, he shares a macro to check for errors after launching a kernel function (which __CUDA_CALL can’t help with).

I’m still thinking of how to overcome the overflow problem, but this post was more to make you aware of how an absolute reduction of floats requires extra care. I will drink something now.