[Fortran][do concurrent] Questions regarding compile options for managing offloading and performance

hkvzjal · April 15, 2023, 10:33am

After solving this issue I went on testing the saxpy.f90 and jacobi.f90 tutorials. I was a bit suprised by the results I got. I would like to ask for guidance to see if I could exploit better the compiler and my available GPU.

Working env: Windows11 > WSL2+Ubuntu20.04+nvfortran 2023 + cuda 12.0
Hardware: CPU: Intel Xeon Gold 6226 - GPU: RTXA6000

Results:

flags				    saxpy seq	saxpy par	jacobi seq	jacobi par
'-O3'				    380			344			68462		501312
'-O3 -stdpar=multicore' 334			9870		76774		400194
'-O3 -stdpar=gpu'	    760			7326		74864		884412

When I compile for the gpu, should I understand that when I run the binary, the ‘do concurrent’ loops will be automatically offloaded and the rest of the code is still runing on the CPU? (looking at the taskmanager I did see the GPU working but I’m not sure which part of the work it took)
Is the system_clock subroutine being called by the CPU only? is it measuring also the data transfer time? How could I isolate these two times?
I tried including this flag in the compilation ‘-gpu=cc80,cuda12.0’ but saw no difference. Is there anything else I could do to better test the performance?

Thanks

MatColgrove · April 17, 2023, 4:57pm

Hi hkvzjal,

Keep in mind that these are toy programs and not well suited for measuring performance. Most of the GPU time is spent initializing the device and copying data to/from the device. There’s a trivial amount of work which wont offset the overhead costs.

Is the system_clock subroutine being called by the CPU only? is it measuring also the data transfer time? How could I isolate these two times?

Under the hood, DO CONNCURRENT is using CUDA Unified Memory (UM), meaning the data movement is handled by the driver. As the data is accessed, the driver pages it between the host and device. Hence the data movement is included in the kernel time.

If the data is already on the device, then it wont get copied again (assuming it wasn’t updated on the host between calls). Hence to see the overhead cost, run the “saxpy_concurrent” routine twice. The difference will be the device initialization and data movement cost.

Alternatively, you can add OpenACC data directives to handle the data movement and disable UM via “-gpu=nomanaged”. Also adding “use openacc”, and “call acc_init(acc_get_device_type())” at the top of the program will move the device initialization out of the timing loop.

I tried including this flag in the compilation ‘-gpu=cc80,cuda12.0’ but saw no difference. Is there anything else I could do to better test the performance?

Since you have CC80 device running CUDA 12.0, these are the default options so explicitly adding them wouldn’t make a difference. Given these are trivial codes, there’s not much you can do other than not time the initialization and data movement. Better to move to using a non-trivial code. for performance testing.

For performance testing, you might instead look at either CloverLeaf or POT3D:

hkvzjal · April 18, 2023, 7:37am

Thank you @MatColgrove, so indeed by mixing the use of the directives with the do concurrent I managed (I think) to isolate the parallel region to obtain much more interesting results!!

I followed the example in your 1st link and did the following changes:

...
call system_clock( count=c0 )
call saxpy_do(x2, y, n, a)
call system_clock( count=c1 )

!$acc enter data copyin(x, y, n, a)
call system_clock( count=c2 )
call saxpy_concurrent(x, y, n, a)
call system_clock( count=c3 )
!$acc exit data delete(x, y, n, a)
cseq = c1 - c0
cpar = c3 - c2

Same thing in the Jacobi example but changing the in/out variables before calling the smooth subroutine

!$acc enter data copyin(aapar, bbpar, w0, w1, w2, n, m, iters)
...
!$acc exit data delete(aapar, bbpar, w0, w1, w2, n, m, iters)

Using the flags ‘-stdpar=gpu -acc=gpu -gpu=cc80,cuda12.0,nomanaged’
I got now
Jacobi seq: 69253 micro sec
Jacobi par: 7183 micro sec
saxpy seq: 339 micro sec
saxpy par: 569 micro sec

With the saxpy example, if instead of allocating the arrays to n=1e6, I allocate to n=1e7 the performance gain is much more interesting:
saxpy seq: 5401micro sec
saxpy par: 831 micro sec

system · May 2, 2023, 7:37am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.