PCI passthrough KVM for CUDA usage

nlamember · June 30, 2015, 9:50am

Hello,

I am trying to passthrough Tesla K40m to Virtual Machine(qemu-kvm hypervisor) by vfio.

I download all drivers and CUDA libraries + I compiled all sample files succesfully. However when I run them they run but in the end the do not finish =(. For example run log of deviceQuery:

deviceQuery Starting… CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: “Tesla K40m” //INFO ABOUT IT Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K40m Result = PASS

And then it just hangs, only option is to ctrl+c. Moreover I installed everything on host too and there it finished sucessfully without any problems. Any help will be appreciated.

dmesg on VM says only: [ 1475.225692] nvidia 0000:00:08.0: irq 51 for MSI/MSI-X dmesg on host: kernel: [ 2897.503162] vfio-pci 0000:02:00.0: irq 324 for MSI/MSI-X

Moreover any call to pci is taking too much time, for example I tried to call nvidia-smi in VM and on host system and traced it via strace. Here output from VM:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
% time seconds usecs/call calls errors syscall

98.67 4.688353 275785 17 open
1.08 0.051337 3020 17 close
0.23 0.010722 104 103 ioctl
0.01 0.000261 22 12 read
0.00 0.000235 9 26 mmap
0.00 0.000177 15 12 write
0.00 0.000127 16 8 munmap
0.00 0.000107 11 10 mprotect
0.00 0.000094 19 5 1 stat
0.00 0.000070 5 15 fstat
0.00 0.000055 8 7 7 access
0.00 0.000030 30 1 execve
0.00 0.000018 5 4 fcntl
0.00 0.000015 8 2 1 futex
0.00 0.000013 4 3 brk
0.00 0.000007 4 2 rt_sigaction
0.00 0.000006 6 1 getrlimit
0.00 0.000005 5 1 lseek
0.00 0.000004 4 1 set_robust_list
0.00 0.000003 3 1 rt_sigprocmask
0.00 0.000003 3 1 arch_prctl
0.00 0.000003 3 1 set_tid_address

100.00 4.751645 250 9 total

Here is output when I run nvidia-smi from host (I de-attach it from VM beforehand)

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
% time seconds usecs/call calls errors syscall

82.25 0.571723 33631 17 open
15.70 0.109104 6418 17 close
1.76 0.012264 119 103 ioctl
0.10 0.000664 44 15 read
0.05 0.000370 14 26 mmap
0.02 0.000155 16 10 mprotect
0.02 0.000152 22 7 7 access
0.02 0.000134 9 15 fstat
0.01 0.000100 13 8 munmap
0.01 0.000078 26 3 brk
0.01 0.000070 6 12 write
0.01 0.000069 17 4 fcntl
0.01 0.000062 62 1 execve
0.00 0.000029 6 5 1 stat
0.00 0.000021 11 2 rt_sigaction
0.00 0.000021 11 2 1 futex
0.00 0.000010 10 1 rt_sigprocmask
0.00 0.000010 10 1 getrlimit
0.00 0.000010 10 1 arch_prctl
0.00 0.000010 10 1 set_tid_address
0.00 0.000009 9 1 set_robust_list
0.00 0.000000 0 1 lseek

100.00 0.695065 253 9 total

As you can see “open” from VM takes too much time. I have no idea why.
Can anybody help me? Sorry for so much text

mariojmdavid · December 21, 2015, 3:12pm

hi
I think I have the same problem
physical host is centos7 (unbind the dev all that)

First with a centos7 VM (kvm) and later tried also with ubuntu14.04 VM
In the VM, and in both cases
I installed cuda 7.5, the samples and the NVIDIA-Linux-x86_64-352.39 driver

The cuda samples

1_Utilities/deviceQuery/deviceQuery
1_Utilities/deviceQuery/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
…
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

But other such as
0_Simple/cdpSimplePrint/cdpSimplePrint
starting Simple Print (CUDA Dynamic Parallelism)
Running on GPU 0 (Tesla K40m)

The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2

In total 2+8=10 blocks are launched!!! (8 from the GPU)

^C
I have to CTRL C

I have made some straces

strace -o aacuda -ff -r -ttt -x -y -s 1024 /usr/local/cuda/samples/0_Simple/cdpSimpleQuicksort/cdpSimpleQuicksort

this spawns 2 processes, the first one gets stuck in a

 0.000020 futex(0x7ffe5fdcc7d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1450707640, 875071000}, ffffffff) = 0
 0.000027 ioctl(3</dev/nvidiactl>, 0xc020462a, 0x7ffe5fdcc6e0) = 0
 0.000327 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316384059}) = 0
 0.000109 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316479220}) = 0

infinite calls to clock_gettime(CLOCK_MONOTONIC_RAW

in the 2nd process

 0.000039 read(8<pipe:[17877]>, "\xab", 1) = 1
 0.000039 futex(0x7ffe5fdcc7d0, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315902405}) = 0
 0.000041 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315985134}) = 0
 0.000063 poll([{fd=8<pipe:[17877]>, events=POLLIN}, {fd=10</dev/nvidia0>, events=POLLIN}, {fd=11</dev/nvidia0>, events=POLLIN}, {fd=12</dev/nvidia0>, even

ts=POLLIN}, {fd=13pipe:1640, events=POLLIN}], 5, 77) = 0 (Timeout)

and the 2 clock_gettime(CLOCK_MONOTONIC_RAW calls plus the “poll timeout” are called infinitely many times

from both processes seems pipe to the /dev/nvidia0 device, something gets written to the dev but nothing returned, and then it keeps trying infinitely

if you or anyone figured this out I would very much appreciate any hints. note I can boot/reboot the host machine, and can insert boot parameters in grub if necessary
grub/boot contains
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1”

(I think the last unsafe_interrupts may not be necessary or advisable but…)

the only thing I would not like to try would be a kernel reconfig/recompilation
for info the host machine is a Dell PE R730, with 1 NVIDIA Tesla k40

best and tia
Mario

mariojmdavid · December 21, 2015, 3:14pm

hi
I think I have the same problem
physical host is centos7 (unbind the dev all that)

First with a centos7 VM (kvm) and later tried also with ubuntu14.04 VM
In the VM, and in both cases
I installed cuda 7.5, the samples and the NVIDIA-Linux-x86_64-352.39 driver

The cuda samples

1_Utilities/deviceQuery/deviceQuery
1_Utilities/deviceQuery/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla K40m”
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores
…
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

But other such as
0_Simple/cdpSimplePrint/cdpSimplePrint
starting Simple Print (CUDA Dynamic Parallelism)
Running on GPU 0 (Tesla K40m)

The CPU launches 2 blocks of 2 threads each. On the device each thread will
launch 2 blocks of 2 threads each. The GPU we will do that recursively
until it reaches max_depth=2

In total 2+8=10 blocks are launched!!! (8 from the GPU)

^C
I have to CTRL C

I have made some straces

strace -o aacuda -ff -r -ttt -x -y -s 1024 /usr/local/cuda/samples/0_Simple/cdpSimpleQuicksort/cdpSimpleQuicksort

this spawns 2 processes, the first one gets stuck in a

 0.000020 futex(0x7ffe5fdcc7d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1450707640, 875071000}, ffffffff) = 0
 0.000027 ioctl(3</dev/nvidiactl>, 0xc020462a, 0x7ffe5fdcc6e0) = 0
 0.000327 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316384059}) = 0
 0.000109 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 316479220}) = 0

infinite calls to clock_gettime(CLOCK_MONOTONIC_RAW

in the 2nd process

 0.000039 read(8<pipe:[17877]>, "\xab", 1) = 1
 0.000039 futex(0x7ffe5fdcc7d0, FUTEX_WAKE_PRIVATE, 1) = 1
 0.000034 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315902405}) = 0
 0.000041 clock_gettime(CLOCK_MONOTONIC_RAW, {849, 315985134}) = 0
 0.000063 poll([{fd=8<pipe:[17877]>, events=POLLIN}, {fd=10</dev/nvidia0>, events=POLLIN}, {fd=11</dev/nvidia0>, events=POLLIN}, {fd=12</dev/nvidia0>, even

ts=POLLIN}, {fd=13pipe:1640, events=POLLIN}], 5, 77) = 0 (Timeout)

and the 2 clock_gettime(CLOCK_MONOTONIC_RAW calls plus the “poll timeout” are called infinitely many times

from both processes seems pipe to the /dev/nvidia0 device, something gets written to the dev but nothing returned, and then it keeps trying infinitely

if you or anyone figured this out I would very much appreciate any hints. note I can boot/reboot the host machine, and can insert boot parameters in grub if necessary
grub/boot contains
GRUB_CMDLINE_LINUX_DEFAULT=“intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1”

(I think the last unsafe_interrupts may not be necessary or advisable but…)

the only thing I would not like to try would be a kernel reconfig/recompilation
for info the host machine is a Dell PE R730, with 1 NVIDIA Tesla k40

best and tia
Mario

mariojmdavid · December 22, 2015, 11:06am

just to let all know, we solved this problem, either 1 or both of 2 things

update kernel from
3.10.0-229.11.1 to 3.10.0-327.3.1

update qemu-kvm and qemu-kvm-common

qemu-kvm--1.5.3-86 to qemu-kvm--ev-2.3.0-31

Tomar · April 3, 2016, 10:23pm

Hi Mario,
I am also facing the same problem when I try to passthrough my Tesla K40m on kvm using vfio-pci. The qemu version I’m using is 2.2.0 on Ubuntu 14.04 (kernel version 4.2.0)

$ kvm --version
QEMU emulator version 2.2.0 (Debian 1:2.2+dfsg-5expubuntu9.3~cloud0), Copyright (c) 2003-2008 Fabrice Bellard

I’ve also noted that deviceQuery succeeds albeit after long time (~16 secs), but any other cuda sample has the same problem - 100% cpu utilization, and lots of clock_gettime calls and it never completes. I debugged it and I’ve the same findings as yours, the cuda sample application continuously makes some ioctl call to possibly detect change in some state of the card but it doesn’t note that change and hence continuously keeps on waiting there, making the clock_gettime call in order to report how much time it spent in that operation.

I went through the nvidia driver code hoping to get more insight into which ioctl the application is making and what is it expecting to change, but alas it lead me to rm_ioctl() which is part of the nvidia binary driver.

I’m glad to hear that your problem went way after upgrading to qemu 2.3. Since I’m already using 2.2, I hope there is some change that went between 2.2 and 2.3, since my kernel version is already uptodate.

Since Ubuntu 14.04 does not have a qemu package for ver > 2.2, I’ll have to compile by hand. Let me try and I’ll update with my findings.

Thanks,
Tomar

Tomar · April 3, 2016, 11:06pm

Mario, also wanted to check what firmware you are using for your virtual machines - Seabios (plain old bios) or OVMF (virtual UEFI f/w).

Thanks,
Tomar

Tomar · April 5, 2016, 11:08pm

I can also confirm that upgrading qemu to version 2.4.1 solved this problem.
I didn’t try 2.3, so maybe that also solves the problem as Mario has reported.
This means the problem exists in qemu 2.2.0 but not in 2.3.x+ versions.

Thanks,
Tomar

Topic		Replies	Views
deviceQuery passes and then fails CUDA Setup and Installation	4	2144	July 6, 2016
Install Problem CUDA Programming and Performance	32	12705	December 17, 2009
trying to get a tesla k10 online. cuda_5.5.22_linux_64.run fails Linux	18	5799	February 16, 2014
Windows 7 no CUDA-capable device is detected CUDA Setup and Installation	23	19264	January 9, 2018
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64284	April 20, 2011
340.106 nvidia-uvm.ko fails to build under kernel 4.14.y Linux	16	7320	October 14, 2021
Ubuntu 14.04: optimus + CUDA Linux	16	43187	March 10, 2016
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6481	January 18, 2020
Problems with CUDA drivers for NVIDIA Hardware CUDA Setup and Installation	9	1261	October 27, 2020
deviceQuery OK, everything else hangs Cuda sdk 4.1 examples simply hang, no errors, no warnings CUDA Programming and Performance	12	8876	April 23, 2012

PCI passthrough KVM for CUDA usage

I download all drivers and CUDA libraries + I compiled all sample files succesfully. However when I run them they run but in the end the do not finish =(. For example run log of deviceQuery:

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = Tesla K40m Result = PASS

Moreover any call to pci is taking too much time, for example I tried to call nvidia-smi in VM and on host system and traced it via strace. Here output from VM:

100.00 4.751645 250 9 total

Here is output when I run nvidia-smi from host (I de-attach it from VM beforehand)

100.00 0.695065 253 9 total

Related topics