Why the timestep = 0 in my do concurrent reduce min code?

gxming · March 6, 2025, 10:36am

(1) Codes running on CPU
!$acc update self(ElmGTS)
TimeStep = MinVal(ElmGTS)
!$acc update device(TimeStep)

(2) Codes running on GPU
TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
!$acc update self(TimeStep)

In code(1), !$acc update self(ElmGTS) transfers ElmGTS from device to host. I just calculate its minimum value of array ElmGTS.
However, I want to avoid the transferring of ElmGTS, as it’s very huge. So I wrote the code (2). In this code, I want to calculate the TimeStep on GPU, so there is no need to transfer the ElmGTS array from device to host. I compile code (2) using HPC SDK 23.3.
As I know that from HPC SDK 22.11, it supports reduce clause (MIN : var) which can find the minimum value in an array implemented on GPU. This avoid the transferring of ElmGTS from device to host.
However, Code (2) can’t obtain the right value of TimeStep, and TimeStep is always 0.0. Does anyone know what the problem is ?
Thank you very much.

MatColgrove · March 6, 2025, 6:26pm

Can you please provide a reproducing example of the failing case? I tried to reproduce it here, but mine works fine.

% cat testcon.F90
program main
   integer :: i, NumElms
   real :: TimeStep
   real, dimension(:), allocatable :: ElmGTS
   NumElms = 1024
   allocate(ElmGTS(NumElms))

   call RANDOM_NUMBER(ElmGTS)

   TimeStep = ElmGTS(1)
!$acc enter data copyin(TimeStep)
   Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
     TimeStep = Min(TimeStep, ElmGTS(i))
   End Do
!$acc update self(TimeStep)
!$acc exit data delete(TimeStep)
  print *, TimeStep
  deallocate(ElmGTS)

end program main
% nvfortran -stdpar=gpu testcon.F90; a.out
   1.7807492E-03

Note that you don’t need to use the OpenACC directives for TimeStep, the reduction will take care of the data movement.

gxming · March 6, 2025, 9:58pm

I just found my problem, but thank you all the same. Your code is good.

(2) Codes running on GPU
TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
!$acc update self(TimeStep)

In my program, ElmGTS is updated in a loop. Before the loop, I use !$acc enter data copyin(ElmGTS).

Then in the loop, it’s updated using the do concurrent, and does not !$acc update self(ElmGTS) after it is calculated every time.

That is , the ElmGTS in device are only updated. The ElmGTS in host is always the initial ElmGTS which was initialized to be 0. So TimeStep = ElmGTS(1) means that TimeStep is always 0, and the minimum is therefore always 0.

gxming · March 7, 2025, 10:57am

I found another problem in this code.

!$acc update self(ElmGTS)
TimeStep = MinVal(ElmGTS)
!$acc update device(TimeStep)
Write(6,*)TimeStep

When I compile the codes by VS in windows, The Timestep is always near 1.0E-02, more or less. The values in ElmGTS is almost neary 1.0E-02. The program succeed. I think this can show that the algorithm in the code is right. Do you think so ?

However, when I compile the codes by nvfortran in Ubuntu, The TimeStep is becoming smaller and smaller, even 1.0E-08, and the program failed, becuase it is too small and the Curtime = Curtime + timestep is almost unchanged. The Curtime can not reach the EndTime.

I think the latter program by nvfortran did not find the minimum value of ElmGTS array. Doesn’t it ?

Do you know why it is and how to do with it ?

MatColgrove · March 7, 2025, 7:25pm

Do you know why it is and how to do with it?

Without a reproducing example, I can only offer guesses and general advice.

What exactly are you comparing? The exact same source or are you comparing to the GPU enabled version?

If it’s the same source with both running on the CPU, then it could be the difference in optimization. You’re code may be numerically sensitive thus needs strict compliance to IEEE754. In that case, try adding the flag “-Kieee”. This will disable some optimizations which may reorder operations thus giving slightly different results.

You can also consider using double precision if your code currently uses single precision. This will increase the accuracy.

If it’s the GPU version, then it could be any number of things.

Parallelization reorders operations which in turn effects accuracy as rounding error can be different. There’s not much that can be done here, especially with reductions.

There could be a race condition in the code which is affecting the results.

It could be an data synchronization issue between the device and host.

gxming · March 7, 2025, 9:00pm

I have two type of codes. They are both to calculate the minimum value of ElmGTS , give it to TimeStep. I want to find which is valid and better.

--------------------------------------(1) Code
    !$acc update self(ElmGTS)
    TimeStep = MinVal(ElmGTS)
    !$acc update device(TimeStep)

--------------------------------------(2)Code
    !$acc update self(ElmGTS(1))
    TimeStep = ElmGTS(1)
    !$acc update device(TimeStep)
    Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
       TimeStep = Min(TimeStep, ElmGTS(i))
    End Do
    !$acc update self(TimeStep)

The TimeStep and ElmGTS are both Real (kind=8), that is double accuracy.
Following your advice, I used the -Kieee Flag, and it helps in some degree.
For (1) Code, when this flag is used, the TimeStep is always at 1.0E-02. IF not, TimeStep become smaller and smaller.

For (2) Code, Whatever the Flag is used or not, TimeStep become smaller and smaller.

For me, I want to use the (2)Code, because the only one value ElmGTS(1) is transfered from device to host.
In the (1)Code, the whole array ElmGTS is transfered. For my program, the size of ElmGTS is three millions and it’s Real(kind = 8). I want avoid the transfer this array as much as possible.
Could you give me some advice ?

MatColgrove · March 7, 2025, 10:59pm

About why you’re seeing TimeStep is becoming smaller? No, really I would need to see the entire code and be able to reproduce the problem. There’s too much missing information here to know what’s wrong.

I will say that you can simplify code#2 since the first element of ElmGTS does need to be copied back. Just set TimeStep to a HUGE value and iterate from the first element.

TimeStep = HUGE(0.0)
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
   TimeStep = Min(TimeStep, ElmGTS(i))
End Do

gxming · March 8, 2025, 10:43am

The whole codes is large which includs about twenty *.F90 files. Following your above advices, I recompile the code, and I found a strange problem.

TimeStep = HUGE(0.0)
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
   TimeStep = Min(TimeStep, ElmGTS(i))
End Do

Sometimes when I start the program, the TimeStep is always near 1.0E-02, but other times when I start program, it is becoming smaller and smaller. That is, the program is not stable.
Do you know why ? What are the possible reasons ?
To my knowledge, is it due to that some variables are not initialized ?

chenbr · March 10, 2025, 4:05am

TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
I think codes above might be equivalent to below:
Do Concurrent(i = 1:NumElms) Reduce(Min:TimeStep)
TimeStep = ElmGTS(i)
End Do

MatColgrove · March 10, 2025, 4:42pm

No, I have no way of knowing without more details. Note that twenty files isn’t that big, so if you can share them, that would be great. If you can’t share these publicly, please send me a direct message by following the link to my profile and click on the “message” button. You can then provide details on how to obtain your code.

To my knowledge, is it due to that some variables are not initialized ?

That’s one possibility. You can use Valgrind on the CPU version to see if it finds any UMRs.

Topic		Replies	Views
Order of operations within do concurrent on GPU nvc, nvc++ and nvfortran	8	578	July 31, 2023
When i use CUDA, my program gets slower CUDA Programming and Performance	5	529	April 2, 2019
massive hiccups when transferring flags back to host challenging the commonly quoted 2-10us latency CUDA Programming and Performance	14	38117	February 1, 2011
Advice on porting to an HPC application to GPU nvc, nvc++ and nvfortran	6	81	July 30, 2024
DO CONCURRENT matmul slow on Grace Hopper nvc, nvc++ and nvfortran	3	137	July 9, 2024
Converting OpenMP from multicore to GPU question nvc, nvc++ and nvfortran	8	748	September 7, 2021
Timing the code CUDA Programming and Performance	5	4708	July 28, 2011
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	584	June 3, 2024
Cuda code performance CUDA Programming and Performance	14	3133	December 16, 2014
Getting strange result difference executing code on K40 GPU with CUDA 8 CUDA Programming and Performance	12	1331	March 9, 2017

Why the timestep = 0 in my do concurrent reduce min code?

I just found my problem, but thank you all the same. Your code is good.

(2) Codes running on GPU TimeStep = ElmGTS(1) Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep) TimeStep = Min(TimeStep, ElmGTS(i)) End Do !$acc update self(TimeStep)

I found another problem in this code.

Related topics

(2) Codes running on GPU
TimeStep = ElmGTS(1)
Do Concurrent(i = 2:NumElms) Reduce(Min:TimeStep)
TimeStep = Min(TimeStep, ElmGTS(i))
End Do
!$acc update self(TimeStep)