Pinned memory with multiple socket nodes

grynet · June 3, 2015, 9:11am

Hi all,

I have one system which has 2 socket and each socket has 2 numa nodes. This system has 1 gpu for each socket. In total it’s 4 NUMA nodes and 2 GPUs (k40).

My question is that while I am using a pinned memory I cannot observe any slowdown when I want to access 2nd GPU. May it related with pinned memory?

I executed my application by using CPU on 2nd socket which has K40m(1). But I also access K40(m) which is located in first socket.

Some lines from nvprof,

Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
6.71694s  203.04us                    -               -         -         -         -  2.0972MB  10.329GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71697s  201.57us                    -               -         -         -         -  2.0972MB  10.404GB/s   Tesla K40m (0)         2        21  [CUDA memcpy DtoH]
6.71715s  201.47us                    -               -         -         -         -  2.0972MB  10.409GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71717s  201.44us                    -               -         -         -         -  2.0972MB  10.411GB/s   Tesla K40m (0)         2        21  [CUDA memcpy DtoH]
6.71735s  201.44us                    -               -         -         -         -  2.0972MB  10.411GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71756s  201.41us                    -               -         -         -         -  2.0972MB  10.413GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]
6.71776s  201.41us                    -               -         -         -         -  2.0972MB  10.413GB/s   Tesla K40m (1)         1        57  [CUDA memcpy DtoH]

Thanks in advance

grynet · June 4, 2015, 10:16am

By the way, I am using hwloc to locate my host application into another sockets.

Robert_Crovella · June 4, 2015, 2:19pm

If you have a multithreaded application, and you are accessing each GPU from a particular thread, then it’s possible that the thread arrangement has lined up properly with the logical core distribution so as to locate each thread “close” to it’s GPU.

The CPU sockets, if in an Intel system, are connected by QPI, typically, from what I have seen. I don’t recall the exact transfer speed of QPI but it’s pretty fast and may have gotten faster lately with Haswell. So the specifics of your system may be important here, which you haven’t indicated. Anyway, if the QPI is fast enough (faster than ~10GB/s) it might be that the above pattern is possible, even with a socket-to-socket transfer of data.

grynet · June 7, 2015, 4:02pm

Thank you for your answer @txbob, it was very useful for me.

Yes I’ve multi threaded application but I’m using hwloc in order to assign my threads into socket. And my all threads are working on socket-0. And I am observing where my threads are located by checking htop.

My system is not intel but you’re right, that connection is ~20-30Gbit. AS I understood, that’s why I don’t feel any performance slowdown.

Topic		Replies	Views
Memory throughput problem between host and device with pinned memory CUDA Programming and Performance	14	1854	April 28, 2015
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2821	June 24, 2016
Performance effects of pinned memory CUDA Programming and Performance	5	1002	January 27, 2011
Dazed and Confused.. CUDA Programming and Performance	6	1412	April 8, 2013
Tesla K40 L2 bandwidth CUDA Programming and Performance	12	4035	December 23, 2015
Unified Memory vs Pinned Host Memory vs GPU Global Memory CUDA Programming and Performance	9	8797	June 1, 2022
A few general questions... CUDA Programming and Performance	2	3070	October 12, 2009
Memory-intensive applications? How to know the resource bottleneck? CUDA Programming and Performance	6	8989	January 30, 2016
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5719	October 2, 2013
CUDA, NUMA Memory, and "NUMA" GPUs CUDA Programming and Performance	3	19828	December 4, 2010

Pinned memory with multiple socket nodes

Related topics