GPU benchmark NAMD runs faster on VM than navie execution

Hi all,

I am currently studying how the GPU benchmark performance can be affected by CPU-bound benchmark if they are collocated together. A weird observation I got is that one benchmark, NAMD CUDA, can actually run faster on VM than native execution under the same interference from the CPU-bound co-runner.

Here are my experiment settings:

Native: UBUNTU 18.04.2 LTS, 16 cores@2.00GHz, 1 NVIDIA Tesla P100 PCIe 16GB GPU, both NAMD and CPU-bound benchmark launch 16 threads and run simultaneously on the machine.

VM: Built on the same native machine above, VMM is KVM, two VMs are created and both use UBUNTU 18.04.2 LTS
       VM1 (runs NAMD) – 16 VCPUs (one-on-one pinned on cores), 1 NVIDIA Tesla P100 PCIe 16GB GPU (using Direct-Passthrough)
       VM2 (runs CPU-bound benchmark) – 16 VCPUs (one-on-one pinned on cores)

The command to run NAMD is ./namd2 +p 16 +devices 0  apoa1/apoa1.namd
The co-running CPU-bound program uses 1600% of CPU if it is running alone on the native machine or the VM.

Here is the result (run time and GPU utilization of NAMD):
Native: 
          Execution time: 85.1 seconds
          GPU utilization: 21.5%
VM:
          Execution time: 69.5 seconds
          GPU utilization: 29.0%

The native execution of NAMD under the interference from the CPU-bound program has a higher execution time and lower GPU utilization which is contradicting theoretical beliefs. Do you know what could be the problem or how I can analyze this observation with more experiments?

Do you have an Xserver running on the gpu in both cases? If not, nvidia-persistenced started?

@generix

I don’t have an Xserver running on the GPU in either case.

Also, nvidia-persistenced is off for the GPU in both cases. Do you suggest I start it?

@generix

I just rechecked and in the VM the nvidia-persistenced is started but in the native machine the nvidia-persistenced has failed to start. This is the output when I check the status in the native machine:

sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2019-11-12 13:58:50 EST; 14min ago
Process: 2291 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
Process: 2240 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, status=1/FAILURE)

Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Started (2241)
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Failed to open libnvidia-cfg.so.1: libnvidia-cfg.so.1: cannot open shared object file: No such file or directory
Nov 12 13:58:50 Jupiter nvidia-persistenced[2240]: nvidia-persistenced failed to initialize. Check syslog for more details.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: PID file unlocked.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: PID file closed.
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Nov 12 13:58:50 Jupiter nvidia-persistenced[2241]: Shutdown (2241)
Nov 12 13:58:50 Jupiter systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
Nov 12 13:58:50 Jupiter systemd[1]: nvidia-persistenced.service: Failed with result ‘exit-code’.
Nov 12 13:58:50 Jupiter systemd[1]: Failed to start NVIDIA Persistence Daemon.

When running without X, nvidia-persistenced has to be running, otherwise malfunction and lowered performance is to be expected. Please fix the driver install on the native system, then rerun the benchmark.

@generix

I fixed the driver install on the native machine and nvidia-persistenced is running as it should BUT I am still getting the same problem that I stated in the question. Do you have any other thoughts?

Not really. Maybe some interference from smt?

@generix

The smt is turned off for our machine.

The machine layout has 4 sockets with 20 cores each and for this experiment we are only using 16 cores from 1 socket.

Looking at your test setup, you either run both (cpu/gpu) tasks on different VMs or in parallel on the same native host. So I’d suspect a difference between the distribution of cpu time to vcpus vs. the normal cpu scheduler.