OpenACC and CUFFT performance issues HPC

Hello there,

I am developing a program for my research and I am using GPU parallelization with OpenACC and CUFFT. I am using nvhpc version number 22.9. The program works fine, and has been validated. However, I am facing some performance issues and I honestly cannot understand why.

So, let’s say I am running my code on my workstation, that mounts an NVIDIA Quadro P1000, and it takes N seconds to run. When I run the same simulation in my laptop, that mounts an NVIDIA GTX 1660 Ti, It takes N/2 seconds (approximately). Then I tried to run it in my research group’s cluster, that mounts a NVIDIA Tesla V100 32GB GPU, and it takes 3*N seconds. Finally, I ran the same simulation on my Department’s cluster, that mounts an NVIDIA A100 80GB GPU, and it takes almost N seconds.

So, apart from the fact that it seems that I have access to many GPUs but I cannot make use of them (-.-"), how can I optimize my code for the use with those GPUs (the V100 and A100)?

I compile the code with these flags:
nvfortran -I /[…]/.fftw/include -L /usr/local/cuda-11.7/lib64 -cpp -O3 -Minfo=accel -g -lcufft -lcufftw -use_fast_math -fast -acc=gpu -gpu=managed -cuda -cudalib=cufft -module /[…]/Modules -c /[…]/Sources/filename.f03 -o /[…]/Objects/filename.o

Thank you

Try profiling the app. Compare profiling results across the platforms of interest.

Given that performance appears to be all over the place, my initial working hypothesis would be that the workload applied is not in fact identical across platforms, which could be due to an outright bug (e.g. uninitialized piece of data), a randomized component (leading to a butterfly effect), or a platform-dependant auto-scaling mechanism of some sort.