Question: GPUDirect - For host to host rdma using cuda to work, does the output from cudaDeviceCanAccessPeer need to be 1?
–background–
I’m trying to do RDMA from gpu memory to gpu memory between two hosts.
The HCA is a ConnectX-5 VPI setup with Eth fabric / openibd.
Output of lspci -t: see attachment
The computers are just Dell Optiplex 9020s with their latest bios.
The OS is RHEL 7.6 linux (kernel 3.10.0-957.el7.x86_64),
nv_peer_memory_1.0-8 (compiled from -master),
cuda_10.1.168_418.67_linux,
OFED 4.6-1.0.1.1
The GPUs are Quadro K600s (compute level 3.0).
The following is a full dump of the gpu capabilities (same gpu on both boxes):
Name = Quadro K600
uuid = 0x441B3084
luid[0] = 0x0
luid[1] = 0x0
luid[2] = 0x0
luid[3] = 0x0
luid[4] = 0x0
luid[5] = 0x0
luid[6] = 0x0
luid[7] = 0x0
luidDeviceNodeMask = 0
totalGlobalMem = 1029963776
sharedMemPerBlock = 49152
regsPerBlock = 65536
warpSize] = 32
memPitch = 2147483647
maxThreadsPerBlock = 1024
maxThreadsDim[3] = 1024,1024,64
maxGridSize[3] = 2147483647,65535,65535
clockRate = 875500 KHz
totalConstMem = 65536
major compute capability = 3
minor compute capability = 0
textureAlignment = 512
texturePitchAlignment = 32
deviceOverlap = 1
multiProcessorCount = 1
kernelExecTimeoutEnabled = 0
integrated = 0
canMapHostMemory = 1
computeMode = 0
maxTexture1D = 65536
maxTexture1DMipmap = 16384
maxTexture1DLinear = 134217728
maxTexture2D[2] = 65536,65536
maxTexture2DMipmap[2] = 16384,16384
maxTexture2DLinear = 65000,65000,1048544
maxTexture2DGather[2] = 16384,16384
maxTexture3D[3] = 4096,4096,4096
maxTexture3DAlt[3] = 2048,2048,16384
maxTextureCubemap = 16384
maxTexture1DLayered[2] = 16384,2048
maxTexture2DLayered[3] = 16384,16384,2048
maxTextureCubemapLayered[2] = 16384,2046
maxSurface1D = 65536
maxSurface2D[2] = 65536,32768
maxSurface3D[3]) = 65536,32768,2048
maxSurface1DLayered[2] = 65536,2048
maxSurface2DLayered[3] = 65536,32768,2048
maxSurfaceCubemap = 32768
maxSurfaceCubemapLayered[2] = 32768,2046
surfaceAlignment = 512
concurrentKernels = 1
ECCEnabled = 0
pciBusID = 5
pciDeviceID = 0
pciDomainID = 0
tccDriver = 0
asyncEngineCount = 1
unifiedAddressing= 1
memoryClockRate = 891000
memoryBusWidth = 128
l2CacheSize = 262144
maxThreadsPerMultiProcessor = 2048
streamPrioritiesSupported = 0
globalL1CacheSupported = 0
localL1CacheSupported = 1
sharedMemPerMultiprocessor = 49152
regsPerMultiprocessor = 65536
managedMemory = 1
isMultiGpuBoard = 0
multiGpuBoardGroupID = 0
hostNativeAtomicSupported = 0
singleToDoublePrecisionPerfRatio = 24
pageableMemoryAccess = 0
concurrentManagedAccess = 0
computePreemptionSupported = 0
canUseHostPointerForRegisteredMem = 0
cooperativeLaunch = 0
cooperativeMultiDeviceLaunch= 0
sharedMemPerBlockOptin = 49152
pageableMemoryAccessUsesHostPageTables = 0
directManagedMemAccessFromHost = 0
lspciv.txt (36.5 KB)