I’m trying to find an easy way to get the sum of a big array (with a varying size) in CUDA without success … I’ve found an example of reduction but the code is very old and not that easy to work with. I just have a device pointer of type float4, his size, and I want the sum …

There is a cuda parallel reduction sample code which should be useful. There is an accompanying PDF if you search for it “Mark Harris parallel reduction”

Finally, libraries like thrust (and cub) offer simple, convenient methods for reduction (google thrust::reduce)

Reduction of a vector type (float4) immediately raises questions in my head about your exact intent, but that doesn’t seem to be central to the very general question you have asked.

And the error just says : error no suitable constructor exists to convert from “int” to “float4”
While dW is basically a float4 *.

I’m using float4 in the case of a quaternion neural network and I want to add the L2 Regularization, thus I need to perform a big summation of dW which is my weight matrix.

Why not just describe what you want in simple math?

For example:

I have an array of float4. I want a summation where the float4 result contains the result of each component, e.g. result.x = summation(element.x), and the same for .y, .z, .w

This is actually exactly what you described. The result of the summation of float4 one = a, b, c, d and float4 two = e, f, g, h would be : float4 three = a+e, b+f, c+g, d+h.