Hello there,
I am developing a program for my research and I am using GPU parallelization with OpenACC and CUFFT. I am using nvhpc version number 22.9. The program works fine, and has been validated. However, I am facing some performance issues and I honestly cannot understand why.
So, let’s say I am running my code on my workstation, that mounts an NVIDIA Quadro P1000, and it takes N seconds to run. When I run the same simulation in my laptop, that mounts an NVIDIA GTX 1660 Ti, It takes N/2 seconds (approximately). Then I tried to run it in my research group’s cluster, that mounts a NVIDIA Tesla V100 32GB GPU, and it takes 3*N seconds. Finally, I ran the same simulation on my Department’s cluster, that mounts an NVIDIA A100 80GB GPU, and it takes almost N seconds.
So, apart from the fact that it seems that I have access to many GPUs but I cannot make use of them (-.-"), how can I optimize my code for the use with those GPUs (the V100 and A100)?
I compile the code with these flags:
nvfortran -I /[…]/.fftw/include -L /usr/local/cuda-11.7/lib64 -cpp -O3 -Minfo=accel -g -lcufft -lcufftw -use_fast_math -fast -acc=gpu -gpu=managed -cuda -cudalib=cufft -module /[…]/Modules -c /[…]/Sources/filename.f03 -o /[…]/Objects/filename.o
Thank you