In a numa system, is there a direct way to get this GPU/CPU map information? It looks to me the only way to get device id is to run some cuda program and enumerate from 0 to n-1. Does the device ID information reside somewhere else? Maybe proc file system? Or some nvidia tool?
I know we can use latency/bandwidth test to generate the map. But it would be better if I extract this information directly.
Does that mean I can use a CUDA API call to tell me the gpu-cpu affinity, then use something like pthread_setaffinity_np() to bind a thread to the right CPU? If so that solves the last of my major multi-gpu problems, if not can I have it added to “the list”?
How can one retrieve the affinities using the CUDA API? My nvidia-smi outputs null, complaining about diferent kernel module and driver versions, though it runs. But now won’t be the time to reinstall a newer driver. On my multiple-gpu applications I simply threw the host threads to different processors, but to no avail on performance…
I thought that “you’ll be able to” was refering to something like it can be done with, now that I read it again, the phrasing makes sense as to not ask the question I eventually did :"> . My apologies.
Meaning if I run the SDK bandwidthtest on a host thread on all different cores there’ll be a core for which the bandwidth curves are optimal? I’m not running a NUMA system. Furthermore and quite a bit off the topic here, but how can one explain the drop in bandwidth a Tesla C1060 , device to host, show in this picture Tesla C1060 Device to Host
Just to expand on what seibert said a little, the problems with modern NUMA machines are twofold: pci-e controllers can have a natural cpu affinity (ie. they are directly connected to one cpu, but an additional or hop(s) away on a QPI ot HT link from the the others), and memory is distributed around amongst different cpu memory controllers. So doing something like allocating pinned memory for cuda and doing a copy to a GPU can potentially show a lot of variation in effective throughput depending on GPU-CPU-RAM affinity. This is only seen on multi-cpu machines (or possibly on the new AMD Magny Cours processor, which is effectively NUMA within a single socket).
To convince yourself that it’s not just a detail, here is some small experiments: i’ve taken Tim’s great “concurrent bandwith” test, which basically stress all the links of your machine, so i you have a bad mapping of your CUDA threads/contexts, you will obtain poor performance because of a high contention (the latency is not important here).
Here are the performance of Tim’s original benchmark on a machine with 3 GPUs and 2 NUMA nodes (there are 8 nehalem cores).
gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest 0 1 2
Device 0 took 1303.871582 ms
Device 1 took 2427.816406 ms
Device 2 took 1739.778564 ms
Average HtoD bandwidth in MB/s: 11223.201904
Device 0 took 2677.287842 ms
Device 1 took 6912.451660 ms
Device 2 took 6304.711426 ms
Average DtoH bandwidth in MB/s: 4331.458069
And it turns out that the performance are not that stable so that sometimes i get bad HtoD bandwith (~8GB/s) (because the threads may not be binded as we mentionned earlier)
So i modified Tim’s test to do some HtoD bandwith benchmarking and detect which NUMA node is the closest one. Note that this is also what is done in various libraries (especially in Guochun Shi’s CUDA wrapper). Once the closest NUMA node is found for the different devices, i bind the thread that holds the context on the CPUs which are associated to that NUMA node. I used the hwloc library (Portable Hardware Locality (hwloc)) to do that because there is unfortunately no easy way to bind a thread on a NUMA node or more simply to distinguish two contexts of the same hyperthreaded core (as those nehalems are hypertheaded). But basically it’s just a “portable” pthread_setaffinty…
gonnet@hannibal:~/These/Tests/ConcurrentBandwithTest$ ./concBandwidthTest_numa 0 1 2
Device 0 took 1121.960327 ms
Device 1 took 2155.578125 ms
Device 2 took 1382.516846 ms
Average HtoD bandwidth in MB/s: 13302.581055
Device 0 took 3831.385010 ms
Device 1 took 4927.671387 ms
Device 2 took 3831.820068 ms
Average DtoH bandwidth in MB/s: 4639.426758
As you see, this trivial mechanism is giving a significant improvement … and the results are much more stable too. Binding the CUDA threads in such a way also gave significant improvements for less trivial benchmarks. When we have other IO (MPI or disks), such consideration may become even more interesting. Conclusion: i can’t wait to get CUDA 3.1 ! ;)