RDMA GPUDirect//nvidia-peer-memory/cuda issue

andrew.lucas1 · July 15, 2019, 6:18pm

Hi,

I’m having an issue getting RDMA GPU to GPU working correctly.

I’m setup at the ibverbs/cmverbs level.

My RDMA transaction starts by the client sending an IBV_WR_SEND request with the buffer information for the server to do a IBV_WR_RDMA_WRITE_WITH_IMM back of a much larger buffer.

The GPUs are just Quadros (K600) and the HCAs are VPI ConnectX-5s running Eth.

I can get a transfer to occur from the server host memory to the client gpu memory, but I cannot get either a gpu-gpu transfer to occur, or a server gpu memory to a client host memory to occur.

What I see as a response from the IBV_WR_RDMA_WRITE_WITH_IMM is a wc failure on the server of a local protection fault. So basically I get that error anytime I try to do a transfer from the server gpu to a client, but not for host memory to a client’s gpu memory. The layout is the same for the server host memory as it is for the server gpu memory (one just uses the cudaMalloc).

RHEL 7.6, nv_peer_memory_1.0-8, cuda 10.1, OFED 4.6-1.0.1.1

Is there some configuration item I’m missing when sourcing from a GPU memory vs the host memory? I’m just using cudaMalloc and the same ibv_reg_mr call for the GPU version and posix_memalign and ibv_reg_mr for the host memory version.

Will this configuration work GPU-GPU? And if not, why would host-GPU work?

Also host->gpu will work in either direction.

Any suggestions?

Thanks

alekseys1 · July 15, 2019, 9:03pm

it can be a lot of things. What is the output of:

$sudo lspci -vvv | grep -i acs

Does it work without nv_peer_mem module loaded?

alekseys1 · July 16, 2019, 9:17pm

if it helps, perftest suite - GitHub - linux-rdma/perftest: Infiniband Verbs Performance Tests - has code with CUDA support. I would suggest to check the working code. In addition, linux-rdma mailing list could it be better choice to ask this kind of programming questions

andrew.lucas1 · July 17, 2019, 4:05pm

Although a useful suggestion (ref perftest_resources.c in that package, ~L62 function pp_init_gpu) that doesn’t seem to correct my problem. I’ve checked the WR pointers and they appear to be correct. Does GPUDirect just not support an IBV_WR_RDMA_WRITE_WITH_IMM transaction? The perftest only appears to test IBV_WR_RDMA_WRITE and IBV_WR_RDMA_READ.

I also question if I should be using cudaMalloc or the cuMemAlloc that the perftest is using? I thought we aren’t supposed to mix those libraries?

Thanks

andrew.lucas1 · July 17, 2019, 9:53pm

With the latest perftest tool, I’m seeing something similar (failed status 4 which is the local protection fault) but with the ib_write_bw if I use the --use_cuda.

server:

./ib_write_bw -d mlx5_0 -i 1 -F --report_gbits -R --use_cuda

client:

./ib_write_bw -d mlx5_0 -i 1 -F --report_gbitgs 15.15.15.5 -R --use_cuda

It works without the “–use_cuda”.

The ib_read_bw fails in the same manner.

Any hints?

mlx5: D2701: got completion with error:

00000000 00000000 00000000 00000000

00000001 00000000 00000000 00000000

00000000 00008914 10000137 000097d2

Completion with error at client

Failed status 11: wr_id 0 syndrom 0x89

scnt=128, ccnt=0

Failed to complete run_iter_bw function successfully

initializing CUDA

There is 1 device supporting CUDA

[pid = 5441, dev = 0] device name = [Quadro K600]

creating CUDA Ctx

making it the current CUDA Ctx

cuMemAlloc() of a 131072 bytes GPU buffer

allocated GPU buffer address at 0000000b00a00000 pointer=0xb00a00000

RDMA_Read BW Test

Dual-port : OFF Device : mlx5_0

Number of qps : 1 Transport type : IB

Connection type : RC Using SRQ : OFF

TX depth : 128

CQ Moderation : 100

Mtu : 4096[B]

Link type : Ethernet

GID index : 2

Outstand reads : 16

rdma_cm QPs : ON

Data ex. method : rdma_cm

local address: LID 0000 QPN 0x0137 PSN 0x876fba

GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:07

remote address: LID 0000 QPN 0x00b2 PSN 0xacda4

GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:05

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

andrew.lucas1 · July 17, 2019, 10:16pm

No… without the nv_peer_mem the ibv_reg_mr of the cuda memory fails.

See attached for lspci output for the client and server and also a dump of the k600 quadro capabilities.

The output from cudaDeviceCanAccessPeer is 0.

I’m going between two hosts and I can still write into gpu memory on one end… I just can’t source from gpu memory. Its the same if I swap the client/server ends.

I’m somewhat stuck.

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

ARICap: MFVC- ACS-, Next Function: 1

ARICtl: MFVC- ACS-, Function Group: 0

ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

ARICap: MFVC- ACS-, Next Function: 0

ARICtl: MFVC- ACS-, Function Group: 0

ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

and a dump of the gpu capabilities:

Name = Quadro K600

uuid = 0x441B3084

luid[0] = 0x0

luid[1] = 0x0

luid[2] = 0x0

luid[3] = 0x0

luid[4] = 0x0

luid[5] = 0x0

luid[6] = 0x0

luid[7] = 0x0

luidDeviceNodeMask = 0

totalGlobalMem = 1029963776

sharedMemPerBlock = 49152

regsPerBlock = 65536

warpSize] = 32

memPitch = 2147483647

maxThreadsPerBlock = 1024

maxThreadsDim[3] = 1024,1024,64

maxGridSize[3] = 2147483647,65535,65535

clockRate = 875500 KHz

totalConstMem = 65536

major compute capability = 3

minor compute capability = 0

textureAlignment = 512

texturePitchAlignment = 32

deviceOverlap = 1

multiProcessorCount = 1

kernelExecTimeoutEnabled = 0

integrated = 0

canMapHostMemory = 1

computeMode = 0

maxTexture1D = 65536

maxTexture1DMipmap = 16384

maxTexture1DLinear = 134217728

maxTexture2D[2] = 65536,65536

maxTexture2DMipmap[2] = 16384,16384

maxTexture2DLinear = 65000,65000,1048544

maxTexture2DGather[2] = 16384,16384

maxTexture3D[3] = 4096,4096,4096

maxTexture3DAlt[3] = 2048,2048,16384

maxTextureCubemap = 16384

maxTexture1DLayered[2] = 16384,2048

maxTexture2DLayered[3] = 16384,16384,2048

maxTextureCubemapLayered[2] = 16384,2046

maxSurface1D = 65536

maxSurface2D[2] = 65536,32768

maxSurface3D[3]) = 65536,32768,2048

maxSurface1DLayered[2] = 65536,2048

maxSurface2DLayered[3] = 65536,32768,2048

maxSurfaceCubemap = 32768

maxSurfaceCubemapLayered[2] = 32768,2046

surfaceAlignment = 512

concurrentKernels = 1

ECCEnabled = 0

pciBusID = 5

pciDeviceID = 0

pciDomainID = 0

tccDriver = 0

asyncEngineCount = 1

unifiedAddressing= 1

memoryClockRate = 891000

memoryBusWidth = 128

l2CacheSize = 262144

maxThreadsPerMultiProcessor = 2048

streamPrioritiesSupported = 0

globalL1CacheSupported = 0

localL1CacheSupported = 1

sharedMemPerMultiprocessor = 49152

regsPerMultiprocessor = 65536

managedMemory = 1

isMultiGpuBoard = 0

multiGpuBoardGroupID = 0

hostNativeAtomicSupported = 0

singleToDoublePrecisionPerfRatio = 24

pageableMemoryAccess = 0

concurrentManagedAccess = 0

computePreemptionSupported = 0

canUseHostPointerForRegisteredMem = 0

cooperativeLaunch = 0

cooperativeMultiDeviceLaunch= 0

sharedMemPerBlockOptin = 49152

pageableMemoryAccessUsesHostPageTables = 0

directManagedMemAccessFromHost = 0

The output from cudaDeviceCanAccessPeer is 0.

alekseys1 · July 18, 2019, 8:23pm

Does it work using hugepages?

andrew.lucas1 · July 26, 2019, 9:54pm

This much at least turned out to be a bad build. Something was crossed up with my various installs. I reinstalled mofed and the installed tools worked. My local build of the perftest master was messed up. After the mofed install, I reconfigured and rebuilt the perftest. That then started to work also. Doesn’t appear to have solved my code’s problem, but at least I know the hardware is working and the installation is correct now. So that’s actually a very useful step.

andrew.lucas1 · August 1, 2019, 9:19pm

See attached for some sample code for GPU to GPU transfer between hosts. It’s based off the geek in the corners sample 2 I think… It gives a protection error (4) if doing and RDMA_WRITE from cuda memory to cuda memory (not host/cuda pinned memory, that works).

Its a bit of a hack as I do a cudaMemcpy to fill/read the gpu memory with the message string being sent/recevied, but it exercises the RDMA.

I get the same error if I try to do a loopback (to same GPU), but I wasn’t expecting that to work anyway.

Could I please get someone to try and see if the attached works for them?

Or review it and clue me in on what I’m doing wrong?

I’ve also tried using the method of memory allocation in perftools (not in attached), but I get the same results. The perftools run ok (ib_write_bw) per the earlier post, though they were giving the same error until I reconfigured/recompiled perftools. Maybe I’m linking something badly?

Same error EN or IPoIB, using ConnextX-5 VPI cards.

You might need to edit the makefile to get rid of that last cp to bin.

andrew.lucas1 · August 12, 2019, 8:25pm

I have to backpedal on that last answer.

In order for cuda support to be configured into the perftools, it looks like you have to define CUDA_PATH (/usr/local/cuda-xx) and CUDA_H_PATH (/usr/local/cuda-xx/include/cuda.h).

If you don’t do that, the perftools will still accept the --use_cuda flag, but it won’t do anything with it.

alekseys1 · August 21, 2019, 9:54pm

I’ve verified internally that using fresh OS installation + fresh Mellanox OFED v4.6, CUDA v9.2 and perftest v4.4 recompiled with CUDA support works

andrew.lucas1 · September 12, 2019, 1:33pm

The end result of this whole mess is that the NVIDIA K600 is not supported with gpu memory rdma.

I’m not clear if the list is being maintained, but if your gpu is not on this list: https://developer.nvidia.com/gpudirectforvideo#GPUs then its probably not going to work.

Topic		Replies	Views
Having issues getting host gpu to host gpu RDMA to work CUDA Programming and Performance	2	1961	July 17, 2019
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1737	November 30, 2020
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13135	May 3, 2011
ibv_reg_mr got file exists error when used nv_peer_mem	2	531	September 9, 2017
RDMA between K80 GPU and remote host RDMA Software For GPU	1	657	May 28, 2017
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	3159	April 11, 2019
GPUDirect - for host to host rdma using cuda, does the output from cudaDeviceCanAccessPeer need to be 1? Software And Drivers	1	253	July 19, 2019
GPUDirect question - cudaDeviceCanAccessPeer information CUDA Programming and Performance	9	4564	January 2, 2020
GPUDirect RDMA support with CUDA 5 CUDA Programming and Performance	19	9323	May 28, 2013
RDMA GPU Direct Slow CUDA Programming and Performance	10	2664	February 13, 2019

RDMA GPUDirect//nvidia-peer-memory/cuda issue

Related topics