how to relate device ID to CPU cores/ PCIe ID in NUMA system

gshi · December 3, 2009, 11:26pm

In a numa system, is there a direct way to get this GPU/CPU map information? It looks to me the only way to get device id is to run some cuda program and enumerate from 0 to n-1. Does the device ID information reside somewhere else? Maybe proc file system? Or some nvidia tool?

I know we can use latency/bandwidth test to generate the map. But it would be better if I extract this information directly.

thanks
-gshi

tmurray · December 3, 2009, 11:46pm

To be fixed in a future release!

Jeremy_Enos · April 6, 2010, 10:27pm

I’ll second this request. Would be very helpful in creating auto-affinity utilities.

tmurray · April 6, 2010, 10:35pm

I’ll go on record and say that this is fixed in 3.1.

eyalhir74 · April 7, 2010, 6:30am

Hi - Will there be some short document that explains how to do this?

Can you please explain how?

thanks

eyal

tmurray · April 7, 2010, 7:11am

You’ll be able to get PCI bus information from the CUDA API as well as nvidia-smi.

avidday · April 7, 2010, 8:14am

Does that mean I can use a CUDA API call to tell me the gpu-cpu affinity, then use something like pthread_setaffinity_np() to bind a thread to the right CPU? If so that solves the last of my major multi-gpu problems, if not can I have it added to “the list”?

tmurray · April 7, 2010, 4:06pm

Something like that, yes.

andradx · April 10, 2010, 1:36am

How can one retrieve the affinities using the CUDA API? My nvidia-smi outputs null, complaining about diferent kernel module and driver versions, though it runs. But now won’t be the time to reinstall a newer driver. On my multiple-gpu applications I simply threw the host threads to different processors, but to no avail on performance…

seibert · April 10, 2010, 1:49am

tmurray said this new API would be available in CUDA 3.1, which has not been released yet, so I don’t think he can tell you any of the details.

andradx · April 10, 2010, 2:20am

I thought that “you’ll be able to” was refering to something like it can be done with, now that I read it again, the phrasing makes sense as to not ask the question I eventually did :"> . My apologies.

SPWorley · April 10, 2010, 3:58am

I expect no official confirmation, but I’ll mention it anyway…

Tim, does the 3.1 nvidia-smi tool allow for setting GPU and memory clocks?

nvidia-settings does, but only when the x server is running and then only for display cards.

I just want to downclock my compute cards to see CUDA program bottleneck dependencies on GPU shader and memory clocks.

tmurray · April 10, 2010, 4:01am

Arbitrary clock control is definitely not something I’m planning on introducing into nvidia-smi in the near future.

SPWorley · April 10, 2010, 4:18am

OK, well, I’ll wait for the Far Future then.

Is Linux GPU clock adjustment restricted to display cards just because of support hassle? (probably…)

andradx · April 10, 2010, 1:29pm

Meaning if I run the SDK bandwidthtest on a host thread on all different cores there’ll be a core for which the bandwidth curves are optimal? I’m not running a NUMA system. Furthermore and quite a bit off the topic here, but how can one explain the drop in bandwidth a Tesla C1060 , device to host, show in this picture Tesla C1060 Device to Host

seibert · April 10, 2010, 3:26pm

If you are not running a NUMA system, you should not see any significant difference between cores.

Edit: To be more specific, the issue being discussed here is latency/bandwidth differences between CPU sockets in a multi-socket system.

avidday · April 10, 2010, 5:07pm

Just to expand on what seibert said a little, the problems with modern NUMA machines are twofold: pci-e controllers can have a natural cpu affinity (ie. they are directly connected to one cpu, but an additional or hop(s) away on a QPI ot HT link from the the others), and memory is distributed around amongst different cpu memory controllers. So doing something like allocating pinned memory for cuda and doing a copy to a GPU can potentially show a lot of variation in effective throughput depending on GPU-CPU-RAM affinity. This is only seen on multi-cpu machines (or possibly on the new AMD Magny Cours processor, which is effectively NUMA within a single socket).

gonnet · April 11, 2010, 2:43pm

To convince yourself that it’s not just a detail, here is some small experiments: i’ve taken Tim’s great “concurrent bandwith” test, which basically stress all the links of your machine, so i you have a bad mapping of your CUDA threads/contexts, you will obtain poor performance because of a high contention (the latency is not important here).

Here are the performance of Tim’s original benchmark on a machine with 3 GPUs and 2 NUMA nodes (there are 8 nehalem cores).

gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest 0 1 2
Device 0 took 1303.871582 ms
Device 1 took 2427.816406 ms
Device 2 took 1739.778564 ms
Average HtoD bandwidth in MB/s: 11223.201904
Device 0 took 2677.287842 ms
Device 1 took 6912.451660 ms
Device 2 took 6304.711426 ms
Average DtoH bandwidth in MB/s: 4331.458069

And it turns out that the performance are not that stable so that sometimes i get bad HtoD bandwith (~8GB/s) (because the threads may not be binded as we mentionned earlier)

So i modified Tim’s test to do some HtoD bandwith benchmarking and detect which NUMA node is the closest one. Note that this is also what is done in various libraries (especially in Guochun Shi’s CUDA wrapper). Once the closest NUMA node is found for the different devices, i bind the thread that holds the context on the CPUs which are associated to that NUMA node. I used the hwloc library (Portable Hardware Locality (hwloc)) to do that because there is unfortunately no easy way to bind a thread on a NUMA node or more simply to distinguish two contexts of the same hyperthreaded core (as those nehalems are hypertheaded). But basically it’s just a “portable” pthread_setaffinty…

gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest_numa 0 1 2
Device 0 took 1121.960327 ms
Device 1 took 2155.578125 ms
Device 2 took 1382.516846 ms
Average HtoD bandwidth in MB/s: 13302.581055
Device 0 took 3831.385010 ms
Device 1 took 4927.671387 ms
Device 2 took 3831.820068 ms
Average DtoH bandwidth in MB/s: 4639.426758

As you see, this trivial mechanism is giving a significant improvement … and the results are much more stable too. Binding the CUDA threads in such a way also gave significant improvements for less trivial benchmarks. When we have other IO (MPI or disks), such consideration may become even more interesting. Conclusion: i can’t wait to get CUDA 3.1 ! ;)

PS: i’ve put the slightly modified code on [url=“http://runtime.bordeaux.inria.fr/augonnet/ConcurrentBandwithTest”]http://runtime.bordeaux.inria.fr/augonnet/...entBandwithTest[/url] , but you need to install hwloc to get it working (it’s available in various distros or simply at http://www.open-mpi.org/projects/hwloc/)

bt_hpc_eng · June 26, 2023, 10:59pm

@gonnet can you share the modified script.

Topic		Replies	Views
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1190	April 17, 2023
Cannot get any stream parallelism. CUDA Programming and Performance	13	1294	December 31, 2019
How to calculate memory bandwidth from device properties ? CUDA Programming and Performance	11	5447	June 20, 2015
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5717	October 2, 2013
CUDA test performance issue CUDA Programming and Performance	7	1446	November 24, 2014
From NIC to GPU. CUDA Programming and Performance	40	13600	February 12, 2011
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9593	January 1, 2009
Using the bandwidthTest tool to test the GPU in the htod direction, have fluctuations in the transfer rate CUDA Programming and Performance ubuntu	14	166	February 25, 2025
nvidia-smi and driver api Matching results from the two CUDA Programming and Performance	10	16563	June 18, 2015
CUDA virtual memory management CUDA Programming and Performance cuda	7	218	November 16, 2024

how to relate device ID to CPU cores/ PCIe ID in NUMA system

Related topics