We have our own fatbin loading program and after moving execution to AWS P3 instances, which has V100 GPUs, we have noticed that cuModuleLoadData calls are taking orders of magnitude more than when we ran on our own P100 GPUs. I can’t seem to pinpoint why, if it’s a V100 feature or something from AWS infrastructure.
I’ve tried with different versions of the driver and runtime, as suggested in
My only lead left is this JIT cache, which I’m investigating.
I wrote a quick program that simply loads the fatbins and register all functions, then ran using nvprof.
On the P100 machine:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 73.92% 1.18661s 219 5.4183ms 157.26us 71.897ms cuModuleLoadData
25.00% 401.23ms 1 401.23ms 401.23ms 401.23ms cudaFree
0.67% 10.790ms 9545 1.1300us 346ns 32.084us cuModuleGetFunction
0.25% 3.9880ms 4 997.00us 991.00us 999.41us cuDeviceTotalMem
0.15% 2.3651ms 388 6.0950us 334ns 253.37us cuDeviceGetAttribute
On a P3 with one V100, which I killed because it was taking too long:
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 99.79% 136.254s 189 720.92ms 288.46us 23.6257s cuModuleLoadData
0.20% 270.59ms 1 270.59ms 270.59ms 270.59ms cudaFree
0.01% 11.750ms 7812 1.5040us 778ns 51.901us cuModuleGetFunction
It seems like the cuModuleLoadData times increase as calls are done. The first ones are super fast and then it gets progressively slower. I do notice that one of the CPU cores is always at 100%.
Did anything change from Pascal to Volta that is causing this? Does anyone have any idea on what the culprit could be?
Thanks.
There are too many variables in play here for me to analyze the situation. It would be very useful if you could boil this down to clean experiments where only one variable changes at any one time. In particular, use your existing system and replace a P100 with a V100 making no other changes to either hardware or software (I realize that this may not be practically feasible).
I anticipate that when you do that, no performance outside measurement noise level (2%) will be observed. In other words, the present observation likely boils down to impact from operating system / hypervisor and differences in host system specifications (given that you seem to have already eliminated impact from changes to the CUDA software stack). Each of which in turn probably has multiple contributing potential factors.
Generally speaking, CUDA API function overhead is largely dominated by single-thread CPU performance, and secondarily by system memory performance. In the case of module loading, we have as additional components the loading of the module from some form of mass storage into host memory and transferring the code from there to the GPU. There should be no performance difference in the latter between P100 and V100 as long as both are connected via a PCI gen3 x16 interface.
Yeah, replacing just the GPU is not feasible. We only have access to the V100 on AWS, and I suspect something is up there that we can’t see, although the most likely explanation to this issue is my own stupidity.
I’m trying to recompile everything and see if it matters. Maybe there’ a mismatch that the runtime is trying to fix when loading up sm60 or something.
My first guess was filesystem issues due to EBS, but the same thing happened when all files were in a tmpfs in a nvidia container.
Are you relying on JIT compilation exclusively across all architectures or are you also pre-building SASS through offline compilation? If the latter, make sure to add SASS generation for the sm_70 target to your build, otherwise you will be incurring JIT compilation overhead for generating sm_70 SASS on the fly when loading the module. Which could explain your observations.
I think this was it. After a few hours of recompiling TensorFlow, the overhead is gone. Perhaps it was JIT’ing everything since our binaries were compiled for sm_60. I changed the target and loading is fast now.
The most likely explanation was true afterall.
Thanks for the help!
No kernels were profiled.
Type Time(%) Time Calls Avg Min Max Name
API calls: 58.82% 1.17393s 219 5.3604ms 204.74us 98.446ms cuModuleLoadData
40.38% 805.94ms 1 805.94ms 805.94ms 805.94ms cudaFree
0.74% 14.765ms 9545 1.5460us 699ns 21.797us cuModuleGetFunction
0.04% 722.68us 1 722.68us 722.68us 722.68us cuDeviceTotalMem