Windows VMs hang out NFSoRDMA on CentOS 6.5

Hello. We’ve got stuck on such a problem. 4 nodes are connected to a storage via NFS over RDMA. Hardware is:

Intel 2312WPQJR as a node

Intel R2312GL4GS as a storage with Intel Infiniband 2 ports controller

Infiniband Mellanox SwitchX IS5023 for commutation.

The nodes and storage run CentOS 6.5 with built-in Infiniband package (Linux 2.6.32-431.el6.x86_64)

On the storage is made an array, that is shown in system as /storage/s01. Then it is exported via NFS. The nodes connect to NFS by:

/bin/mount -t nfs -o rdma,port=20049,rw,hard,timeo=600,retrans=5,async,nfsvers=3,intr 192.168.1.1:/storage/s01 /home/storage/sata/01

mount shows:

192.168.1.1:/storage/s01 on /home/storage/sata/01 type nfs

(rw,rdma,port=20049,hard,timeo=600,retrans=5,nfsvers=3,intr,addr=192.168.1.1)

Then we create a virtual machine with virsh with a disk bus virtio. All is OK, until we don’t start Windows on KVM. It may work for 2 hours or 2 days, but under heavy load it hangs the mount (i.e. /sata/02 and 03 are accessible, but requesting 01 will result in a total hang of console). This can be beaten only by hardware reset of the node. If we mount without rdma - all is fine. All linux vms work fine, no problems.

NFS tuning is done, the logs on the time of problem show:

195 Mar 20 09:42:22 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

closed (-103)

196 Mar 20 09:42:42 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

on mlx4_0, memreg 5 slots 32 ird 16

197 Mar 20 09:42:49 v0004 kernel: ------------[ cut here ]------------

198 Mar 20 09:42:49 v0004 kernel: WARNING: at kernel/softirq.c:159

local_bh_enable_ip+0x7d/0xb0() (Not tainted)

199 Mar 20 09:42:49 v0004 kernel: Hardware name: S2600WP

200 Mar 20 09:42:49 v0004 kernel: Modules linked in: act_police cls_u32

sch_ingress cls_fw sch_sfq sch_htb ebt_arp ebt_ip ebtable_nat ebtables

xprtrdma nfs lockd fscache auth_rpcgss nfs_acl sunrpc bridge stp llc

ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables

ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack

ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad

rdma_cm ib_cm iw_cm ib_addr ipv6 openvswitch(U) vhost_net macvtap macvlan

tun kvm_intel kvm iTCO_wdt iTCO_vendor_support sr_mod cdrom sb_edac

edac_core lpc_ich mfd_core igb i2c_algo_bit ptp pps_core sg i2c_i801

i2c_core ioatdma dca mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ext4

jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas

scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod [last

unloaded: scsi_wait_scan]

201 Mar 20 09:42:49 v0004 kernel: Pid: 0, comm: swapper Not tainted

2.6.32-431.5.1.el6.x86_64 #1

202 Mar 20 09:42:49 v0004 kernel: Call Trace:

203 Mar 20 09:42:49 v0004 kernel: [] ?

warn_slowpath_common+0x87/0xc0

204 Mar 20 09:42:49 v0004 kernel: [] ?

warn_slowpath_null+0x1a/0x20

205 Mar 20 09:42:49 v0004 kernel: [] ?

local_bh_enable_ip+0x7d/0xb0

206 Mar 20 09:42:49 v0004 kernel: [] ?

_spin_unlock_bh+0x1b/0x20

207 Mar 20 09:42:49 v0004 kernel: [] ?

rpc_wake_up_status+0x70/0x80 [sunrpc]

208 Mar 20 09:42:49 v0004 kernel: [] ?

xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]

209 Mar 20 09:42:49 v0004 kernel: [] ?

rpcrdma_conn_func+0x9c/0xb0 [xprtrdma]

210 Mar 20 09:42:49 v0004 kernel: [] ?

rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma]

211 Mar 20 09:42:49 v0004 kernel: [] ?

mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib]

212 Mar 20 09:42:49 v0004 kernel: [] ?

mlx4_qp_event+0x74/0xf0 [mlx4_core]

213 Mar 20 09:42:49 v0004 kernel: [] ?

mlx4_eq_int+0x557/0xcb0 [mlx4_core]

214 Mar 20 09:42:49 v0004 kernel: [] ?

rpc_wake_up_task_queue_locked+0x186/0x270 [sunrpc]

215 Mar 20 09:42:49 v0004 kernel: [] ?

mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core]

216 Mar 20 09:42:49 v0004 kernel: [] ?

handle_IRQ_event+0x60/0x170

217 Mar 20 09:42:49 v0004 kernel: [] ?

handle_edge_irq+0xde/0x180

218 Mar 20 09:42:49 v0004 kernel: [] ?

mlx4_cq_completion+0x42/0x90 [mlx4_core]

219 Mar 20 09:42:49 v0004 kernel: [] ? handle_irq+0x49/0xa0

220 Mar 20 09:42:49 v0004 kernel: [] ? do_IRQ+0x6c/0xf0

221 Mar 20 09:42:49 v0004 kernel: [] ?

ret_from_intr+0x0/0x11

222 Mar 20 09:42:49 v0004 kernel: [] ?

__do_softirq+0x73/0x1e0

223 Mar 20 09:42:49 v0004 kernel: [] ?

handle_IRQ_event+0x60/0x170

224 Mar 20 09:42:49 v0004 kernel: [] ?

call_softirq+0x1c/0x30

225 Mar 20 09:42:49 v0004 kernel: [] ? do_softirq+0x65/0xa0

226 Mar 20 09:42:49 v0004 kernel: [] ? irq_exit+0x85/0x90

227 Mar 20 09:42:49 v0004 kernel: [] ? do_IRQ+0x75/0xf0

228 Mar 20 09:42:49 v0004 kernel: [] ?

ret_from_intr+0x0/0x11

229 Mar 20 09:42:49 v0004 kernel: [] ?

intel_idle+0xde/0x170

230 Mar 20 09:42:49 v0004 kernel: [] ?

intel_idle+0xc1/0x170

231 Mar 20 09:42:49 v0004 kernel: [] ?

cpuidle_idle_call+0xa7/0x140

232 Mar 20 09:42:49 v0004 kernel: [] ? cpu_idle+0xb6/0x110

233 Mar 20 09:42:49 v0004 kernel: [] ? rest_init+0x7a/0x80

234 Mar 20 09:42:49 v0004 kernel: [] ?

start_kernel+0x424/0x430

235 Mar 20 09:42:49 v0004 kernel: [] ?

x86_64_start_reservations+0x125/0x129

236 Mar 20 09:42:49 v0004 kernel: [] ?

x86_64_start_kernel+0x115/0x124

237 Mar 20 09:42:49 v0004 kernel: —[ end trace ddc1b92aa1d57ab7 ]—

238 Mar 20 09:42:49 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

closed (-103)

239 Mar 20 09:43:19 v0004 kernel: rpcrdma: connection to 192.168.1.1:20049

on mlx4_0, memreg 5 slots 32 ird 16

On the storage nothing is shown. CentOS virt-list can’t help, so this community is the last place to ask.

Hi Nikolay,

At the beginning of your post you wrote you are using "Intel Infiniband 2 ports controller and later on the dump (which i will assume taken from the same client) i see mlx4 mentioned which is the Mellanox adapter driver. can you clarify about your configuration?

Generally speaking, i can’t recall seeing a lot of users doing NFSoRDMA. I’ve seen more users doing ipoib NFS. i suggest you try it first and see if things improves. other then that. i don’t have anything smart to say but i will forward this on to folks who are more familiar with NFS…

Cheers!

Sorry for no answers - thought the question was ignored and lost the link .

The model of the Intel IB-cards is AXX1FDRIBIOM as i’ve researched. As an update - the NFSoRDMA hangs not only with Win VMs, but just any VMs with no way to predict - there may be 100s of errors Connection closed error -103, and all is fine, or 1-2 lines and hanged up connection…