Overheads introduced by compiling with -Mcuda?

Hi,

When I compile and link my openacc fortran code with the -Mcuda option, I notice a ~2 second increase in every GPU kernel execution time (openacc parallel for region). Why does this happen? Any leads?

thanks,
Naga

Hi Naga,

Can you try profiling your code with and with -Mcuda to see where the extra time is coming from? Is it in the kernel code itself or something else?

By default, PGI’s OpenACC implementation does not use the default CUDA stream since this can cause extra synchronization. However when you add “-Mcuda”, the assumption is that you will be linking with a CUDA compiled code and hence we have to revert back to using the default CUDA stream. While this can cause some slow-down, ~2 seconds per kernel call seems a bit much.

Let’s see what the profiler shows.

-Mat

Hi Mat,

Forgot to mention in my post that I am using unified memory - so I compile my code with -ta=tesla:managed.

Nvprof output comparison between with and without mcuda:

With -Mcuda: I see a 78K increase in GPU page fault groups.

Nvprof summary without -Mcuda:
==61760== Unified Memory profiling result:
Device “Tesla P100-PCIE-16GB (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
3278446 54.733KB 4.0000KB 0.9961MB 171.1300GB 18.349935s Host To Device
133103 1.7151MB 4.0000KB 2.0000MB 222.9288GB 18.563061s Device To Host
249926 - - - - 140.42681s Gpu page fault groups
Total CPU Page faults: 61167

With -Mcuda:

==61896== Unified Memory profiling result:
Device “Tesla P100-PCIE-16GB (0)”
Count Avg Size Min Size Max Size Total Size Total Time Name
4215027 42.572KB 4.0000KB 0.9961MB 171.1314GB 19.168439s Host To Device
132956 1.7172MB 4.0000KB 2.0000MB 222.9591GB 18.557719s Device To Host
328856 - - - - 150.40570s Gpu page fault groups
Total CPU Page faults: 61005

NVprof kernel execution times:
without -Mcuda

15.93% 21.2574s 1 21.2574s 21.2574s 21.2574s psmhd3_1059_gpu
14.94% 19.9377s 1 19.9377s 19.9377s 19.9377s psmhd3_1273_gpu
14.08% 18.7919s 1 18.7919s 18.7919s 18.7919s psmhd3_1493_gpu
12.23% 16.3246s 1 16.3246s 16.3246s 16.3246s psmhd3_1446_gpu
4.40% 5.87420s 1 5.87420s 5.87420s 5.87420s psmhd3_961_gpu

With -Mcuda:

16.77% 23.9730s 1 23.9730s 23.9730s 23.9730s psmhd3_1059_gpu
15.17% 21.6840s 1 21.6840s 21.6840s 21.6840s psmhd3_1273_gpu
14.51% 20.7415s 1 20.7415s 20.7415s 20.7415s psmhd3_1493_gpu
12.35% 17.6543s 1 17.6543s 17.6543s 17.6543s psmhd3_1446_gpu
4.49% 6.41849s 1 6.41849s 6.41849s 6.41849s psmhd3_961_gpu


Any insights on why?

thanks,
Naga