What does "Host, Peer" mean as Source for CUDA Memory Copy in Nsight?

sam_hawker · September 21, 2017, 4:01pm

I have a multi-GPU application running on a PC with two host NUMA nodes (4 GPUs on one node and 1 on the other) doing host to device memory copies which I am expecting to be using pinned host memory allocated using VirtualAllocExNuma() and registered with cudaHostRegister(). When I examine them in Nsight, on some contexts the Source appears as “Host, Pinned” but on others it appears as “Host, Peer”. What does this mean?

I’m also noticing that although cudaHostRegister() returns cudaSuccess for both nodes cudaHostUnregister() returns cudaErrorHostMemoryNotRegistered for one of them.

sam_hawker · September 22, 2017, 10:25am

I think I’ve worked it out.

I think that it reports “Host, Pinned” for the one context per node that actually calls cudaHostRegister() and “Host, Peer” for each other context on the same node.

Similarly, cudaHostUnregister() only succeeds if it is called on the same context that called cudaHostRegister().

Curiously, calling cudaHostRegister() with the same memory address on multiple contexts fails with a cudaErrorHostMemoryAlreadyRegistered error.

This behaviour is slightly annoying because it means that pinned memory is always “owned” by a specific context and ownership can only be transferred by unpinning and re-pinning it. It can’t be shared equally between contexts and managed by a reference count.

cbuchner1 · September 22, 2017, 2:53pm

But from the perspective of the computer system, pinned means pinned. If it’s pinned on the other NUMA node, the data must go through the QPI interconnect but it’s still pinned as in “the page cannot be swapped out to the pagefile”

So it does make some sense for the other context to report “cudaErrorHostMemoryAlreadyRegistered”.

Have you measured significant throughput differences between the “Host, Pinned” and “Host, Peer” variants of the memory transfers?

Christian

njuffa · September 22, 2017, 3:08pm

A GPU context is the equivalent of a host operating system process. Processes own resources, memory in particular. From what you are finding, so do GPU contexts. That seems logical.

sam_hawker · September 22, 2017, 4:23pm

It just seems like it would be better if it was consistent.

If registering a memory address is considered a per-context operation (even though the portable flag makes it pinned for all contexts) then it should be possible to register the same memory address on multiple contexts. If its not per-context then it shouldn’t matter which context unregisters it.

I guess its not a big deal though. The worst case scenario is you end up having to keep a context alive longer than you’d like to just so to keep the memory registered for the other contexts.

njuffa · September 22, 2017, 4:35pm

I am not a security expert, but it seems to me that host processes or GPU contexts simply sharing memory resources could be a gigantic security hole.

Someone with insight into the CUDA driver architecture will have to explain design decisions, I am most certainly not up to that task. When I joined the CUDA team many years ago (at the very start of the project), my one request was “no driver work” :-) But I would occasionally become aware of various driver related issues and architecture decisions. Now I have been retired for more than three years and do not know how unified memory space affects the various aspects of memory management.

Have you checked what the documentation has to say (if anything) on the issue presently before you? If you have feature proposals, you could file a bug report with NVIDIA (prefix synopsis with “RFE:”). That does not mean that your proposal would be taken up, just that someone will take a look at it.