Does nvfortran -stdpar=gpu support two GPUs with NVLink?

Several months ago, I wrote a 2D numerical model into all “do concurrent” form and compiled it with nvfortran 23.7 and “-stdpar=gpu” option. The model was running faster than now compiled with nvfortran 24.7, about 14s vs 24s per time step relatively.
I checked the code and couldn’t find differences, repeatedly running with 24.7 and couldn’t get faster.
Since nvfortran 23.7 was already uninstalled, It was like a dream that it was so fast.
The code was running on the same platform with 2 NVidia A800 GPUs with NVLink.
I am curious about whether nvfortran 23.7 is faster than 24.7, and whether it can use both GPUs automatically with NVLink?
Thanks!

Not in general, but it’s possible there is a regression with your code. If you can provide a reproduce example, I can take a look. If I can confirm a regression due to the compiler, I’ll submit a report to engineering.

whether it can use both GPUs automatically with NVLink?

Multi-gpu programming needs a higher level parallel model with MPI being the best choice.

Thanks for the information. I’ll try it elsewhere to confirm the problem.

CodePR620s.zip (19.7 MB)

Hi, Mat! I tried my 2D ocean model code on a V100 GPU platform with nvfortran 23.7. It runs faster than A800 GPU with nvfortran 24.7. The Xeon(R) Gold 5118 CPU on V100 side is also slower than Xeon(R) Gold 6348 CPU on A800 side. I think the compiler makes the faster hardware running slower. Yet previous record of A800 with nvfortran 23.7 is still faster than V100 with nvfortran 23.7.
So I uploaded my code. After “make” and “./A2D” in Linux shell , my program will display time consumed each 100 steps and save it to TCSP_01.TXT. Folder [saveSpeedResult] saved 2 records.
Thanks for taking a look!
Besides, I had another 3D ocean model which could be used to test this problem. But currently that model only works under nvfortran 24.7 while nvfortran 23.7 does not support procedures in do concurrent structure. I will edit that 3D model later and do additional performance test.

Thanks chenbr!

I was able to reproduce the issue and tracked it down to the compiler implicitly parallelizing several inner reduction loops. While in most cases this would improve performance, it hurts in this case. I’ve filed a problem report, TPR #37206, and sent it to engineering for investigation.

As a work around, you can use OpenACC for these loop with the “gang vector” clauses so the compiler doesn’t do the implicit parallelization. There’s 6 loops in total (2 in ExtAAM.f, 3 in ExtEl.f, and 1 in ReTS.f) which you can find by looking at the compiler feedback messages (i.e. add “-Minfo=acc” to your flagset) where it says “implicit reduction”.

Here’s an example of my changes:

#ifndef USE_OPENACC
       Do CONCURRENT (M=1:NN) LOCAL(SumRIN,SumUXRIN,SumUYRIN,
     * SumVXRIN,SumVYRIN,SumAAMRIN,N,I)
#else
!$acc parallel loop private(SumRIN,SumUXRIN,SumUYRIN,
!$acc&  SumVXRIN,SumVYRIN,SumAAMRIN,N,I) gang vector
      Do M=1,NN
#endif

The add “-Mpreprocess -DUSE_OPENACC -acc” to the makefile to enable the OpenACC code.

-Mat

CodePR621s.zip (19.7 MB)
Thanks a lot for your answer, Mat! The “-Minfo=acc” flag is very useful in improving my models.

In the code “Do N=1,NT_N(M)” where “implicit reduction” happens, NT_N(M) is always <=8. Maybe the compiler considers it as “do concurrent (N=1:NT_N(M)) reduce(+:SumMFlux, Sum0)” while 8 is too small to be reduced faster.

I uploaded an advanced version 621s. In this version, I use “-stdpar=gpu -acc=gpu -gpu=nomanaged” flags and take care of each data transfer by acc directive. By adding “-Minfo=acc” flag, I found a lot of data transfer messages like “Generating implicit copyout(q_qbc(1:numqbc)) [if not already present]”. Does that mean if present then there will be no copyout(or copyin)? Does the “present” mean “in device” or “UpToDate”? It seems that the “-gpu=nomanaged” flag does not block all implicit data trasfers. Then what does “-gpu=nomanaged” flag mean? I want data trasfers only occur under acc directives.

Besides, I am curious about how “-gpu=managed” works. Does it transfer data for each loop? Or intelligently transfer data only when necessary? Or writing to files process can directly read data from device?

Thanks again!

I upgraded “ExtAAM.f” in version 621s. The model is now nearly 2x faster. But the “-Minfo=acc” tells me “implicit reduction” still exists as below.
CodePR621_ExtAAM_upgrade.zip (3.1 KB)

     Generating NVIDIA GPU code
     27, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     31, !$acc loop seq
         Generating implicit reduction(+:sumvrjn,sumurjn,sumrjn)

Anyway, it’s much faster now. It’s great! Thanks!

Correct, though it’s on a sequential loop. The compiler’s typical strategy is to parallelize the inner loop across the thread (vector) dimension. Though here, the inner loop is very small (8) but needs to be at least 32 or more to make it worth while to parallelize. But given the bounds is not known at compile time, it goes ahead with the parallelization and why you need to give it some help.

Sorry, I missed your questions from the preceding post.

Does that mean if present then there will be no copyout(or copyin)?

Correct. Unless you have an explicit data clause on the parallel loop, or it’s within a structured data region, the compiler can’t be sure that the data will be on the device. Hence, it will still add the implicit data copy. However, “present_or_copy” semantics will apply, i.e. if it’s present on the device then no copy will be done.

Does the “present” mean “in device” or “UpToDate”?

The compiler’s runtime creates a present table which contains the mapping between a host and device address. When entering a compute region, it looks up the variable in this table. If it’s “present”, it then passes the device address to the compute kernel and does not do the copy. Though since data needs to be on the device for the compute, it then does the copy if no entry is found.

It seems that the “-gpu=nomanaged” flag does not block all implicit data trasfers.
Then what does “-gpu=nomanaged” flag mean? I want data trasfers only occur under acc directives.

“managed” means that all allocations (heap memory) are done in a unified memory space, meaning that it’s accessible on both the host and the device and the CUDA driver takes care of the data movement for you.

“nomanaged” means that you need to add explicit data movement via data directives. If you want to get rid of the implicit data copy messages, then you need to add the variables in data clause on the compute region.

For example:

!$acc parallel loop present(array1, array2)

or

!$acc parallel loop default(present)

The “present” clause if the host address is not found in the present table, will cause the program to abort at runtime with an error.

Besides, I am curious about how “-gpu=managed” works. Does it transfer data for each loop? Or intelligently transfer data only when necessary? Or writing to files process can directly read data from device?

It uses CUDA Unified Memory. The data movement is handled by the CUDA driver which detects if the data is “dirty” (i.e. has been modified) and if so, copies it at a page level.

When the first kernel is run, the device data needs to be updated so as it’s accessed on the device, it gets copied over. Though for the next kernel call, provided that it hasn’t been modified on the host in-between the kernels, no copy is needed.

Note that since the 2017 article I linked above, CUDA Unified Memory has expanded to also manage all host memory including stack and global, not just heap. So the flag “-gpu=mem:managed” is the older “heap-only” managed memory, while “-gpu=mem:unified” is all memory.

The caveat being that to use fully Unified memory, you need to be using a system with Heterogenous Memory Management (HMM) support and for performance, preferably a system with NVLink like on a Grace-Hopper system. You can do HMM on an x86 system over PCIe, but it can be slow.

Thanks for so many valuable clues for me!

The “dirty” check is a great feature. Writing to files code like " Write (200,*) A(1),B(5)" also detects if A(1) and B(5) is “dirty”, does it?

Implicit data copy messages don’t matter. I only care about whether data transfer happens. If all allocatable arrays are present and not “dirty”, “!$acc parallel loop default(present)” form should take similar computational time to “do concurrent” and “!$acc parallel loop”, as none of them has data movement. “!$acc parallel loop default(present)” differs only in not giving messages when “nomanaged” or not “dirty” . When “dirty”, it differs from the other 2 forms also in not moving data. I prefer “do concurrent” form because it’s simple. My 621s version code is almost the fastest form. If I change all “do concurrent” into “!$acc parallel loop default(present)”, the performance won’t improve. Am I right?

I have a further wish here that the compiler could give me a flag like “-default=present” or “-stdpar=default:present” so that I can easily find my coding mistakes by “not present” error report. Otherwise, with this easy “do concurrent” code, I could only guess where I am wrong and search it hardly.

Thanks a lot!

By “dirty”, I’m meaning that the data has been modified. So, while I’m not 100% sure, just writing the data on the host shouldn’t cause the data to be recopied.

If I change all “do concurrent” into “!$acc parallel loop default(present)”, the performance won’t improve. Am I right?

Unlikely. More likely it would be the about same. Though DO CONCURRENT uses the same code generator as the OpenACC “kernels” construct, so if there are differences, it would be due to using the “parallel” vs “kernels”.

Yes. I mean when A(1) and B(5) on device has been modified, their modified value on device will be written to the file. In my 620s version code, the results in output files are correct.

Many thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.