cuModuleLoadData calls taking orders of magnitude longer in Volta than Pascal


We have our own fatbin loading program and after moving execution to AWS P3 instances, which has V100 GPUs, we have noticed that cuModuleLoadData calls are taking orders of magnitude more than when we ran on our own P100 GPUs. I can’t seem to pinpoint why, if it’s a V100 feature or something from AWS infrastructure.
I’ve tried with different versions of the driver and runtime, as suggested in

My only lead left is this JIT cache, which I’m investigating.

I wrote a quick program that simply loads the fatbins and register all functions, then ran using nvprof.
On the P100 machine:

No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   73.92%  1.18661s       219  5.4183ms  157.26us  71.897ms  cuModuleLoadData
                   25.00%  401.23ms         1  401.23ms  401.23ms  401.23ms  cudaFree
                    0.67%  10.790ms      9545  1.1300us     346ns  32.084us  cuModuleGetFunction
                    0.25%  3.9880ms         4  997.00us  991.00us  999.41us  cuDeviceTotalMem
                    0.15%  2.3651ms       388  6.0950us     334ns  253.37us  cuDeviceGetAttribute

On a P3 with one V100, which I killed because it was taking too long:

No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   99.79%  136.254s       189  720.92ms  288.46us  23.6257s  cuModuleLoadData
                    0.20%  270.59ms         1  270.59ms  270.59ms  270.59ms  cudaFree
                    0.01%  11.750ms      7812  1.5040us     778ns  51.901us  cuModuleGetFunction

It seems like the cuModuleLoadData times increase as calls are done. The first ones are super fast and then it gets progressively slower. I do notice that one of the CPU cores is always at 100%.

Did anything change from Pascal to Volta that is causing this? Does anyone have any idea on what the culprit could be?

There are too many variables in play here for me to analyze the situation. It would be very useful if you could boil this down to clean experiments where only one variable changes at any one time. In particular, use your existing system and replace a P100 with a V100 making no other changes to either hardware or software (I realize that this may not be practically feasible).

I anticipate that when you do that, no performance outside measurement noise level (2%) will be observed. In other words, the present observation likely boils down to impact from operating system / hypervisor and differences in host system specifications (given that you seem to have already eliminated impact from changes to the CUDA software stack). Each of which in turn probably has multiple contributing potential factors.

Generally speaking, CUDA API function overhead is largely dominated by single-thread CPU performance, and secondarily by system memory performance. In the case of module loading, we have as additional components the loading of the module from some form of mass storage into host memory and transferring the code from there to the GPU. There should be no performance difference in the latter between P100 and V100 as long as both are connected via a PCI gen3 x16 interface.

Yeah, replacing just the GPU is not feasible. We only have access to the V100 on AWS, and I suspect something is up there that we can’t see, although the most likely explanation to this issue is my own stupidity.
I’m trying to recompile everything and see if it matters. Maybe there’ a mismatch that the runtime is trying to fix when loading up sm60 or something.
My first guess was filesystem issues due to EBS, but the same thing happened when all files were in a tmpfs in a nvidia container.

Are you relying on JIT compilation exclusively across all architectures or are you also pre-building SASS through offline compilation? If the latter, make sure to add SASS generation for the sm_70 target to your build, otherwise you will be incurring JIT compilation overhead for generating sm_70 SASS on the fly when loading the module. Which could explain your observations.

I think this was it. After a few hours of recompiling TensorFlow, the overhead is gone. Perhaps it was JIT’ing everything since our binaries were compiled for sm_60. I changed the target and loading is fast now.
The most likely explanation was true afterall.
Thanks for the help!

No kernels were profiled.
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   58.82%  1.17393s       219  5.3604ms  204.74us  98.446ms  cuModuleLoadData
                   40.38%  805.94ms         1  805.94ms  805.94ms  805.94ms  cudaFree
                    0.74%  14.765ms      9545  1.5460us     699ns  21.797us  cuModuleGetFunction
                    0.04%  722.68us         1  722.68us  722.68us  722.68us  cuDeviceTotalMem

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.