how to relate device ID to CPU cores/ PCIe ID in NUMA system

In a numa system, is there a direct way to get this GPU/CPU map information? It looks to me the only way to get device id is to run some cuda program and enumerate from 0 to n-1. Does the device ID information reside somewhere else? Maybe proc file system? Or some nvidia tool?

I know we can use latency/bandwidth test to generate the map. But it would be better if I extract this information directly.

thanks
-gshi

To be fixed in a future release!

I’ll second this request. Would be very helpful in creating auto-affinity utilities.

I’ll go on record and say that this is fixed in 3.1.

Hi - Will there be some short document that explains how to do this?

Can you please explain how?

thanks

eyal

You’ll be able to get PCI bus information from the CUDA API as well as nvidia-smi.

Does that mean I can use a CUDA API call to tell me the gpu-cpu affinity, then use something like pthread_setaffinity_np() to bind a thread to the right CPU? If so that solves the last of my major multi-gpu problems, if not can I have it added to “the list”?

Something like that, yes.

How can one retrieve the affinities using the CUDA API? My nvidia-smi outputs null, complaining about diferent kernel module and driver versions, though it runs. But now won’t be the time to reinstall a newer driver. On my multiple-gpu applications I simply threw the host threads to different processors, but to no avail on performance…

tmurray said this new API would be available in CUDA 3.1, which has not been released yet, so I don’t think he can tell you any of the details.

I thought that “you’ll be able to” was refering to something like it can be done with, now that I read it again, the phrasing makes sense as to not ask the question I eventually did :"> . My apologies.

I expect no official confirmation, but I’ll mention it anyway…

Tim, does the 3.1 nvidia-smi tool allow for setting GPU and memory clocks?

nvidia-settings does, but only when the x server is running and then only for display cards.

I just want to downclock my compute cards to see CUDA program bottleneck dependencies on GPU shader and memory clocks.

Arbitrary clock control is definitely not something I’m planning on introducing into nvidia-smi in the near future.

OK, well, I’ll wait for the Far Future then.

Is Linux GPU clock adjustment restricted to display cards just because of support hassle? (probably…)

Meaning if I run the SDK bandwidthtest on a host thread on all different cores there’ll be a core for which the bandwidth curves are optimal? I’m not running a NUMA system. Furthermore and quite a bit off the topic here, but how can one explain the drop in bandwidth a Tesla C1060 , device to host, show in this picture Tesla C1060 Device to Host

If you are not running a NUMA system, you should not see any significant difference between cores.

Edit: To be more specific, the issue being discussed here is latency/bandwidth differences between CPU sockets in a multi-socket system.

Just to expand on what seibert said a little, the problems with modern NUMA machines are twofold: pci-e controllers can have a natural cpu affinity (ie. they are directly connected to one cpu, but an additional or hop(s) away on a QPI ot HT link from the the others), and memory is distributed around amongst different cpu memory controllers. So doing something like allocating pinned memory for cuda and doing a copy to a GPU can potentially show a lot of variation in effective throughput depending on GPU-CPU-RAM affinity. This is only seen on multi-cpu machines (or possibly on the new AMD Magny Cours processor, which is effectively NUMA within a single socket).

To convince yourself that it’s not just a detail, here is some small experiments: i’ve taken Tim’s great “concurrent bandwith” test, which basically stress all the links of your machine, so i you have a bad mapping of your CUDA threads/contexts, you will obtain poor performance because of a high contention (the latency is not important here).

Here are the performance of Tim’s original benchmark on a machine with 3 GPUs and 2 NUMA nodes (there are 8 nehalem cores).

gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest 0 1 2
Device 0 took 1303.871582 ms
Device 1 took 2427.816406 ms
Device 2 took 1739.778564 ms
Average HtoD bandwidth in MB/s: 11223.201904
Device 0 took 2677.287842 ms
Device 1 took 6912.451660 ms
Device 2 took 6304.711426 ms
Average DtoH bandwidth in MB/s: 4331.458069

And it turns out that the performance are not that stable so that sometimes i get bad HtoD bandwith (~8GB/s) (because the threads may not be binded as we mentionned earlier)

So i modified Tim’s test to do some HtoD bandwith benchmarking and detect which NUMA node is the closest one. Note that this is also what is done in various libraries (especially in Guochun Shi’s CUDA wrapper). Once the closest NUMA node is found for the different devices, i bind the thread that holds the context on the CPUs which are associated to that NUMA node. I used the hwloc library (Portable Hardware Locality (hwloc)) to do that because there is unfortunately no easy way to bind a thread on a NUMA node or more simply to distinguish two contexts of the same hyperthreaded core (as those nehalems are hypertheaded). But basically it’s just a “portable” pthread_setaffinty…

gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest_numa 0 1 2
Device 0 took 1121.960327 ms
Device 1 took 2155.578125 ms
Device 2 took 1382.516846 ms
Average HtoD bandwidth in MB/s: 13302.581055
Device 0 took 3831.385010 ms
Device 1 took 4927.671387 ms
Device 2 took 3831.820068 ms
Average DtoH bandwidth in MB/s: 4639.426758

As you see, this trivial mechanism is giving a significant improvement … and the results are much more stable too. Binding the CUDA threads in such a way also gave significant improvements for less trivial benchmarks. When we have other IO (MPI or disks), such consideration may become even more interesting. Conclusion: i can’t wait to get CUDA 3.1 ! ;)

Just my 2 cents,
Cédric

PS: i’ve put the slightly modified code on [url=“http://runtime.bordeaux.inria.fr/augonnet/ConcurrentBandwithTest”]http://runtime.bordeaux.inria.fr/augonnet/...entBandwithTest[/url] , but you need to install hwloc to get it working (it’s available in various distros or simply at http://www.open-mpi.org/projects/hwloc/)