No… without the nv_peer_mem the ibv_reg_mr of the cuda memory fails.
See attached for lspci output for the client and server and also a dump of the k600 quadro capabilities.
The output from cudaDeviceCanAccessPeer is 0.
I’m going between two hosts and I can still write into gpu memory on one end… I just can’t source from gpu memory. Its the same if I swap the client/server ends.
I’m somewhat stuck.
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
and a dump of the gpu capabilities:
Name = Quadro K600
uuid = 0x441B3084
luid[0] = 0x0
luid[1] = 0x0
luid[2] = 0x0
luid[3] = 0x0
luid[4] = 0x0
luid[5] = 0x0
luid[6] = 0x0
luid[7] = 0x0
luidDeviceNodeMask = 0
totalGlobalMem = 1029963776
sharedMemPerBlock = 49152
regsPerBlock = 65536
warpSize] = 32
memPitch = 2147483647
maxThreadsPerBlock = 1024
maxThreadsDim[3] = 1024,1024,64
maxGridSize[3] = 2147483647,65535,65535
clockRate = 875500 KHz
totalConstMem = 65536
major compute capability = 3
minor compute capability = 0
textureAlignment = 512
texturePitchAlignment = 32
deviceOverlap = 1
multiProcessorCount = 1
kernelExecTimeoutEnabled = 0
integrated = 0
canMapHostMemory = 1
computeMode = 0
maxTexture1D = 65536
maxTexture1DMipmap = 16384
maxTexture1DLinear = 134217728
maxTexture2D[2] = 65536,65536
maxTexture2DMipmap[2] = 16384,16384
maxTexture2DLinear = 65000,65000,1048544
maxTexture2DGather[2] = 16384,16384
maxTexture3D[3] = 4096,4096,4096
maxTexture3DAlt[3] = 2048,2048,16384
maxTextureCubemap = 16384
maxTexture1DLayered[2] = 16384,2048
maxTexture2DLayered[3] = 16384,16384,2048
maxTextureCubemapLayered[2] = 16384,2046
maxSurface1D = 65536
maxSurface2D[2] = 65536,32768
maxSurface3D[3]) = 65536,32768,2048
maxSurface1DLayered[2] = 65536,2048
maxSurface2DLayered[3] = 65536,32768,2048
maxSurfaceCubemap = 32768
maxSurfaceCubemapLayered[2] = 32768,2046
surfaceAlignment = 512
concurrentKernels = 1
ECCEnabled = 0
pciBusID = 5
pciDeviceID = 0
pciDomainID = 0
tccDriver = 0
asyncEngineCount = 1
unifiedAddressing= 1
memoryClockRate = 891000
memoryBusWidth = 128
l2CacheSize = 262144
maxThreadsPerMultiProcessor = 2048
streamPrioritiesSupported = 0
globalL1CacheSupported = 0
localL1CacheSupported = 1
sharedMemPerMultiprocessor = 49152
regsPerMultiprocessor = 65536
managedMemory = 1
isMultiGpuBoard = 0
multiGpuBoardGroupID = 0
hostNativeAtomicSupported = 0
singleToDoublePrecisionPerfRatio = 24
pageableMemoryAccess = 0
concurrentManagedAccess = 0
computePreemptionSupported = 0
canUseHostPointerForRegisteredMem = 0
cooperativeLaunch = 0
cooperativeMultiDeviceLaunch= 0
sharedMemPerBlockOptin = 49152
pageableMemoryAccessUsesHostPageTables = 0
directManagedMemAccessFromHost = 0
The output from cudaDeviceCanAccessPeer is 0.