Pinned memory with multiple socket nodes

Hi all,

I have one system which has 2 socket and each socket has 2 numa nodes. This system has 1 gpu for each socket. In total it’s 4 NUMA nodes and 2 GPUs (k40).

My question is that while I am using a pinned memory I cannot observe any slowdown when I want to access 2nd GPU. May it related with pinned memory?

I executed my application by using CPU on 2nd socket which has K40m(1). But I also access K40(m) which is located in first socket.

Some lines from nvprof,

Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
6.71694s  203.04us                    -               -         -         -         -  2.0972MB  10.329GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71697s  201.57us                    -               -         -         -         -  2.0972MB  10.404GB/s   Tesla K40m (0)         2        21  [CUDA memcpy DtoH]
6.71715s  201.47us                    -               -         -         -         -  2.0972MB  10.409GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71717s  201.44us                    -               -         -         -         -  2.0972MB  10.411GB/s   Tesla K40m (0)         2        21  [CUDA memcpy DtoH]
6.71735s  201.44us                    -               -         -         -         -  2.0972MB  10.411GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71756s  201.41us                    -               -         -         -         -  2.0972MB  10.413GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71776s  201.41us                    -               -         -         -         -  2.0972MB  10.413GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]

Thanks in advance

By the way, I am using hwloc to locate my host application into another sockets.

If you have a multithreaded application, and you are accessing each GPU from a particular thread, then it’s possible that the thread arrangement has lined up properly with the logical core distribution so as to locate each thread “close” to it’s GPU.

The CPU sockets, if in an Intel system, are connected by QPI, typically, from what I have seen. I don’t recall the exact transfer speed of QPI but it’s pretty fast and may have gotten faster lately with Haswell. So the specifics of your system may be important here, which you haven’t indicated. Anyway, if the QPI is fast enough (faster than ~10GB/s) it might be that the above pattern is possible, even with a socket-to-socket transfer of data.

Thank you for your answer @txbob, it was very useful for me.

Yes I’ve multi threaded application but I’m using hwloc in order to assign my threads into socket. And my all threads are working on socket-0. And I am observing where my threads are located by checking htop.

My system is not intel but you’re right, that connection is ~20-30Gbit. AS I understood, that’s why I don’t feel any performance slowdown.