I have recently added some device routines in my code using
!$acc routine(name) seq
The code works fine but the compilation time has gone WAY up.
With cuda8.0 the compilation time is extremely long.
With cuda9.0 the compilation time has gone down significantly but it is still much longer than before I used the routines.
Why is this the case? Is it because the code has to compile two versions of the routine (CPU and GPU)?
It’s hard to say without an example and could be due to multiple reasons.
When using just “-ta=tesla” or “-acc”, the compiler is actually creating multiple versions of the GPU which can increase the time. If you know the target device, it may help to using “-ta=tesla:ccXX” where “ccXX” is the compute capability of your particular device.
You can try adding the “-time” flag which will display compilation timing stats. However, only the PGI compiler is instrumented so if the extra time is due to the back-end CUDA compiler, it wont show here.
Another thing to try is using “-v RUN=/usr/bin/time” on the command line. “-v” is the verbose flag where you can see all the steps the compiler driver makes to compile your code. The “RUN” option will prepend all the driver commands with the time utility so you can see which commands are taking the most time.
I am specifying the CC manually (-ta=tesla:cuda9,0,cc60]
The extra compile time happens at the end.
When I look at “htop”, it is when the compiler is running “pgi/linux86-64/2017/cuda/9.0/bin/ptxas”.
Ok, unfortunately there’s not much I can do about the performance of ptxas.
Though, maybe it’s the size of the file we’re feeding it? If this is the case, then you might try breaking up your code into separate files?