I tested the NGC Docker version of HPC SDK 20.11 + Ubuntu 20.04.
The results and speed were the same as OpenMP when the -stdpar=multicore option was used.
However, with -stdpar=gpu, the results were diverged.
According to the compilation information, the DO LOOP inside DO CONCURRENT is also parallelized and put on the GPU. Is there any way to prevent this?
Try adding the flag “-acc=noautopar” to disable device auto-parallelization. Unclear if this is indeed the cause of the divergence, which could be due to a number of reasons, but worth a try.
Thanks for your advice.
Using ‘acc=noautopar’ did not improve the situation.
Since it works fine with '-stdpar= multicore ', is the reason the type of GPU(QuadroRTX8000)?
I’m new to GPGPU, so I don’t know what to fix.
Unfortunately, I have nothing to share.
One of the compilation results is shown below.
The parallelization of the inner loop in lines 909, 937, and 963 on the GPU does not appear on the CPU.
I thought this might be the cause, but is it not?
CPU
852, Generating Multicore code
852, Loop parallelized across CPU threads
926, Generating Multicore code
926, Loop parallelized across CPU threads
GPU
852, Generating Tesla code
852, Loop parallelized across CUDA thread blocks ! blockidx%x
909, Loop parallelized across CUDA threads(32) ! threadidx%x
926, Generating Tesla code
926, Loop parallelized across CUDA thread blocks ! blockidx%x
937, Loop parallelized across CUDA threads(32) ! threadidx%x
963, Loop parallelized across CUDA threads(32) ! threadidx%x