After solving this issue I went on testing the saxpy.f90 and jacobi.f90 tutorials. I was a bit suprised by the results I got. I would like to ask for guidance to see if I could exploit better the compiler and my available GPU.
Working env: Windows11 > WSL2+Ubuntu20.04+nvfortran 2023 + cuda 12.0
Hardware: CPU: Intel Xeon Gold 6226 - GPU: RTXA6000
flags saxpy seq saxpy par jacobi seq jacobi par '-O3' 380 344 68462 501312 '-O3 -stdpar=multicore' 334 9870 76774 400194 '-O3 -stdpar=gpu' 760 7326 74864 884412
- When I compile for the gpu, should I understand that when I run the binary, the ‘do concurrent’ loops will be automatically offloaded and the rest of the code is still runing on the CPU? (looking at the taskmanager I did see the GPU working but I’m not sure which part of the work it took)
- Is the system_clock subroutine being called by the CPU only? is it measuring also the data transfer time? How could I isolate these two times?
- I tried including this flag in the compilation ‘-gpu=cc80,cuda12.0’ but saw no difference. Is there anything else I could do to better test the performance?