I am currently studying how the GPU benchmark performance can be affected by CPU-bound benchmark if they are collocated together. A weird observation I got is that one benchmark, NAMD CUDA, can actually run faster on VM than native execution under the same interference from the CPU-bound co-runner.
Here are my experiment settings:
Native: UBUNTU 18.04.2 LTS, 16 cores@2.00GHz, 1 NVIDIA Tesla P100 PCIe 16GB GPU, both NAMD and CPU-bound benchmark launch 16 threads and run simultaneously on the machine.
VM: Built on the same native machine above, VMM is KVM, two VMs are created and both use UBUNTU 18.04.2 LTS
VM1 (runs NAMD) – 16 VCPUs (one-on-one pinned on cores), 1 NVIDIA Tesla P100 PCIe 16GB GPU (using Direct-Passthrough)
VM2 (runs CPU-bound benchmark) – 16 VCPUs (one-on-one pinned on cores)
The command to run NAMD is ./namd2 +p 16 +devices 0 apoa1/apoa1.namd
The co-running CPU-bound program uses 1600% of CPU if it is running alone on the native machine or the VM.
Here is the result (run time and GPU utilization of NAMD):
Native:
Execution time: 85.1 seconds
GPU utilization: 21.5%
VM:
Execution time: 69.5 seconds
GPU utilization: 29.0%
The native execution of NAMD under the interference from the CPU-bound program has a higher execution time and lower GPU utilization which is contradicting theoretical beliefs. Do you know what could be the problem or how I can analyze this observation with more experiments?
I just rechecked and in the VM the nvidia-persistenced is started but in the native machine the nvidia-persistenced has failed to start. This is the output when I check the status in the native machine:
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Started (2241)
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Failed to open libnvidia-cfg.so.1: libnvidia-cfg.so.1: cannot open shared object file: No such file or directory
Nov 12 13:58:50 Jupiter nvidia-persistenced[2240]: nvidia-persistenced failed to initialize. Check syslog for more details.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: PID file unlocked.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: PID file closed.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Shutdown (2241)
Nov 12 13:58:50 Jupiter systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Nov 12 13:58:50 Jupiter systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Nov 12 13:58:50 Jupiter systemd[1]: Failed to start NVIDIA Persistence Daemon.
When running without X, nvidia-persistenced has to be running, otherwise malfunction and lowered performance is to be expected. Please fix the driver install on the native system, then rerun the benchmark.
I fixed the driver install on the native machine and nvidia-persistenced is running as it should BUT I am still getting the same problem that I stated in the question. Do you have any other thoughts?
Looking at your test setup, you either run both (cpu/gpu) tasks on different VMs or in parallel on the same native host. So I’d suspect a difference between the distribution of cpu time to vcpus vs. the normal cpu scheduler.