I would say, upload the array to GPU memory and than let each thread evaluate a subset of the sums. On one side this task is perfectly parallel; on the other it will be totally constrained by the PCIe bus, so you will not either get close to the peak performance your GPU is capable of.
Here’s some code to compute the result using Thrust. This implementation should be pretty quick, although I there may be more clever ways to accomplish the same goal.