how to carry out the sum operation in cuda fortran?

For a large size array,it’s fairly easy to realize the sum operation in cudaC via pointer, and I just wonder how to perform this operation efficiently in cuda fortran using GPU.

Hi Bullish,

Can you please explain a bit more or post an example of what you mean by “sum operation”?

Performing sum reductions in parallel are quite difficult to perform efficiently, but no more so for Fortran then C. I wrote a basic one for an article I wrote (See:, but by no means is it optimal. NVIDIA has a good slide deck on reductions (See: that helps explain the details.

  • Mat

Hi Mat,
Firstly thank you for your reply. The sum operation I mentioned is exactly the intrinsic function sum() in Fortran. I tried to rewrite function sum() with CUDA Fortran, and the GPU code is much slower than CPU.According to my knowledge, CUDA fortran doesn’t support direct memory address operation, so the GPU capability is hard to be fulled exploited even with the partial sum trick. Have you encountered such problem?

Hi Bullish,

Using the sum intrinsic from within a device kernel would be very slow since each thread would be performing the sum and need to access the device’s global memory. I would advice against using the reduction intrinsics in a device kernel unless you are reducing a small local or shared array.

To efficiently perform reductions, you should follow the partial reduction examples described earlier. Note that sum reductions on a GPU are not expected to be faster then the CPU. Rather, they should only be used if the cost to transfer the data is greater than the cost of the reduction.

Note that as of the 10.5 release, the PGI accelerator model is able to use CUDA Fortran device data. This will allow you to utilize the PGI accelerator’s highly optimized reductions within CUDA Fortran. For example from the host add the follow and tehn compile with “-ta=nvidia”.

!$acc region
  sumVal = sum(devArr)
!$acc end region

As for your question about direct memory address (DMA) operations, again I’m not clear as to what you mean. DMA has to do with how data is transferred to and from the CPU and GPU. Do you mean pinned memory (which is supported in CUDA Fortran)?

  • Mat