Why the timestep = 0 in my do concurrent reduce min code?

(1) Codes running on CPU
!$acc update self(ElmGTS)
TimeStep = MinVal(ElmGTS)
!$acc update device(TimeStep)

(2) Codes running on GPU
TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
!$acc update self(TimeStep)

In code(1), !$acc update self(ElmGTS) transfers ElmGTS from device to host. I just calculate its minimum value of array ElmGTS.
However, I want to avoid the transferring of ElmGTS, as it’s very huge. So I wrote the code (2). In this code, I want to calculate the TimeStep on GPU, so there is no need to transfer the ElmGTS array from device to host. I compile code (2) using HPC SDK 23.3.
As I know that from HPC SDK 22.11, it supports reduce clause (MIN : var) which can find the minimum value in an array implemented on GPU. This avoid the transferring of ElmGTS from device to host.
However, Code (2) can’t obtain the right value of TimeStep, and TimeStep is always 0.0. Does anyone know what the problem is ?
Thank you very much.

Can you please provide a reproducing example of the failing case? I tried to reproduce it here, but mine works fine.

% cat testcon.F90
program main
   integer :: i, NumElms
   real :: TimeStep
   real, dimension(:), allocatable :: ElmGTS
   NumElms = 1024
   allocate(ElmGTS(NumElms))

   call RANDOM_NUMBER(ElmGTS)

   TimeStep = ElmGTS(1)
!$acc enter data copyin(TimeStep)
   Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
     TimeStep = Min(TimeStep, ElmGTS(i))
   End Do
!$acc update self(TimeStep)
!$acc exit data delete(TimeStep)
  print *, TimeStep
  deallocate(ElmGTS)

end program main
% nvfortran -stdpar=gpu testcon.F90; a.out
   1.7807492E-03

Note that you don’t need to use the OpenACC directives for TimeStep, the reduction will take care of the data movement.

I just found my problem, but thank you all the same. Your code is good.

(2) Codes running on GPU
TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
!$acc update self(TimeStep)

In my program, ElmGTS is updated in a loop. Before the loop, I use !$acc enter data copyin(ElmGTS).

Then in the loop, it’s updated using the do concurrent, and does not !$acc update self(ElmGTS) after it is calculated every time.

That is , the ElmGTS in device are only updated. The ElmGTS in host is always the initial ElmGTS which was initialized to be 0. So TimeStep = ElmGTS(1) means that TimeStep is always 0, and the minimum is therefore always 0.

I found another problem in this code.

!$acc update self(ElmGTS)
TimeStep = MinVal(ElmGTS)
!$acc update device(TimeStep)
Write(6,*)TimeStep

When I compile the codes by VS in windows, The Timestep is always near 1.0E-02, more or less. The values in ElmGTS is almost neary 1.0E-02. The program succeed. I think this can show that the algorithm in the code is right. Do you think so ?

However, when I compile the codes by nvfortran in Ubuntu, The TimeStep is becoming smaller and smaller, even 1.0E-08, and the program failed, becuase it is too small and the Curtime = Curtime + timestep is almost unchanged. The Curtime can not reach the EndTime.

I think the latter program by nvfortran did not find the minimum value of ElmGTS array. Doesn’t it ?

Do you know why it is and how to do with it ?

Do you know why it is and how to do with it?

Without a reproducing example, I can only offer guesses and general advice.

What exactly are you comparing? The exact same source or are you comparing to the GPU enabled version?

If it’s the same source with both running on the CPU, then it could be the difference in optimization. You’re code may be numerically sensitive thus needs strict compliance to IEEE754. In that case, try adding the flag “-Kieee”. This will disable some optimizations which may reorder operations thus giving slightly different results.

You can also consider using double precision if your code currently uses single precision. This will increase the accuracy.

If it’s the GPU version, then it could be any number of things.

Parallelization reorders operations which in turn effects accuracy as rounding error can be different. There’s not much that can be done here, especially with reductions.

There could be a race condition in the code which is affecting the results.

It could be an data synchronization issue between the device and host.

I have two type of codes. They are both to calculate the minimum value of ElmGTS , give it to TimeStep. I want to find which is valid and better.

--------------------------------------(1) Code
    !$acc update self(ElmGTS)
    TimeStep = MinVal(ElmGTS)
    !$acc update device(TimeStep)
--------------------------------------(2)Code
    !$acc update self(ElmGTS(1))
    TimeStep = ElmGTS(1)
    !$acc update device(TimeStep)
    Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
       TimeStep = Min(TimeStep, ElmGTS(i))
    End Do
    !$acc update self(TimeStep)

The TimeStep and ElmGTS are both Real (kind=8), that is double accuracy.
Following your advice, I used the -Kieee Flag, and it helps in some degree.
For (1) Code, when this flag is used, the TimeStep is always at 1.0E-02. IF not, TimeStep become smaller and smaller.

For (2) Code, Whatever the Flag is used or not, TimeStep become smaller and smaller.

For me, I want to use the (2)Code, because the only one value ElmGTS(1) is transfered from device to host.
In the (1)Code, the whole array ElmGTS is transfered. For my program, the size of ElmGTS is three millions and it’s Real(kind = 8). I want avoid the transfer this array as much as possible.
Could you give me some advice ?

About why you’re seeing TimeStep is becoming smaller? No, really I would need to see the entire code and be able to reproduce the problem. There’s too much missing information here to know what’s wrong.

I will say that you can simplify code#2 since the first element of ElmGTS does need to be copied back. Just set TimeStep to a HUGE value and iterate from the first element.

TimeStep = HUGE(0.0)
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
   TimeStep = Min(TimeStep, ElmGTS(i))
End Do

The whole codes is large which includs about twenty *.F90 files. Following your above advices, I recompile the code, and I found a strange problem.

TimeStep = HUGE(0.0)
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
   TimeStep = Min(TimeStep, ElmGTS(i))
End Do

Sometimes when I start the program, the TimeStep is always near 1.0E-02, but other times when I start program, it is becoming smaller and smaller. That is, the program is not stable.
Do you know why ? What are the possible reasons ?
To my knowledge, is it due to that some variables are not initialized ?

TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
I think codes above might be equivalent to below:
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
TimeStep = ElmGTS(i)
End Do

No, I have no way of knowing without more details. Note that twenty files isn’t that big, so if you can share them, that would be great. If you can’t share these publicly, please send me a direct message by following the link to my profile and click on the “message” button. You can then provide details on how to obtain your code.

To my knowledge, is it due to that some variables are not initialized ?

That’s one possibility. You can use Valgrind on the CPU version to see if it finds any UMRs.