After solving this issue I went on testing the saxpy.f90 and jacobi.f90 tutorials. I was a bit suprised by the results I got. I would like to ask for guidance to see if I could exploit better the compiler and my available GPU.
Working env: Windows11 > WSL2+Ubuntu20.04+nvfortran 2023 + cuda 12.0
Hardware: CPU: Intel Xeon Gold 6226 - GPU: RTXA6000
Results:
flags saxpy seq saxpy par jacobi seq jacobi par
'-O3' 380 344 68462 501312
'-O3 -stdpar=multicore' 334 9870 76774 400194
'-O3 -stdpar=gpu' 760 7326 74864 884412
- When I compile for the gpu, should I understand that when I run the binary, the âdo concurrentâ loops will be automatically offloaded and the rest of the code is still runing on the CPU? (looking at the taskmanager I did see the GPU working but Iâm not sure which part of the work it took)
- Is the system_clock subroutine being called by the CPU only? is it measuring also the data transfer time? How could I isolate these two times?
- I tried including this flag in the compilation â-gpu=cc80,cuda12.0â but saw no difference. Is there anything else I could do to better test the performance?
Thanks