What is the meaning of CNS(chipset not supported) error in nvidia-smi?

nyamnyam7 · December 3, 2018, 8:17am

If I run “nvidia-smi topo -p2p r”, I get this result.

root@node16:~# nvidia-smi topo -p2p r
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X CNS CNS CNS CNS CNS CNS CNS
GPU1 CNS X CNS CNS CNS CNS CNS CNS
GPU2 CNS CNS X CNS CNS CNS CNS CNS
GPU3 CNS CNS CNS X CNS CNS CNS CNS
GPU4 CNS CNS CNS CNS X CNS CNS CNS
GPU5 CNS CNS CNS CNS CNS X CNS CNS
GPU6 CNS CNS CNS CNS CNS CNS X CNS
GPU7 CNS CNS CNS CNS CNS CNS CNS X

Legend:

X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown

I think it is not normal, so I want to know the meaning and cause of “CNS” error. I googled several hours, but my effort turned out to be in vein. Thanks in advance.

Robert_Crovella · December 3, 2018, 2:46pm

It means that the chipset (i.e. motherboard core logic) you are using is not supported by NVIDIA for P2P traffic between GPUs.

nyamnyam7 · December 5, 2018, 3:28pm

Thank you for your prompt reply. Now I wonder actually what was the cause of the problem. The pci-e switch? Because the pci-e switch doesn’t allow p2p communication? Or because the GPU doesn’t know how to communicate with the pci-e switch for p2p access? Do I have to replace the motherboard? Or will simply a new graphic driver fix the problem?

A mainboard with PEX 8747 chipset had no problem, but the one with PEX 8796, which seems to have better specification, has above problem. And here is the pci-e topology configuration. I am using 8 x RTX2080Tis.

twjang@node16:~$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity
GPU0	 X 	PIX	PIX	PIX	SYS	SYS	SYS	SYS	0-7,16-23
GPU1	PIX	 X 	PIX	PIX	SYS	SYS	SYS	SYS	0-7,16-23
GPU2	PIX	PIX	 X 	PIX	SYS	SYS	SYS	SYS	0-7,16-23
GPU3	PIX	PIX	PIX	 X 	SYS	SYS	SYS	SYS	0-7,16-23
GPU4	SYS	SYS	SYS	SYS	 X 	PIX	PIX	PIX	8-15,24-31
GPU5	SYS	SYS	SYS	SYS	PIX	 X 	PIX	PIX	8-15,24-31
GPU6	SYS	SYS	SYS	SYS	PIX	PIX	 X 	PIX	8-15,24-31
GPU7	SYS	SYS	SYS	SYS	PIX	PIX	PIX	 X 	8-15,24-31

Thanks in advance.

Lezz · January 4, 2019, 3:04am

I see a similar thing on my system (Skylake, Intel C621 chipset, PLX PEX8796 switches):

mystery "CNS" in `nvidia-smi topo -p2p r` everywhere except "X"
two groups of "PIX" in `nvidia-smi topo -m` result
but I've got "NODE" instead of "SYS"

And I also wonder if “CNS” would be there for any system with PEX8796, with C621 or with any Skylake. Is it a “driver-level ban for specific hardware, by ID”? Or is it a result of a sort of a “test run”?

Some people say that “nvidia-smi is a thin wrapper around NVML”. But the only thing in NVML docs (vR410 | October 2018) that sounds similar is nvmlDeviceGetBridgeChipInfo - “if bridge chip not supported on the device” but I doubt it’s actually related - the “Only applicable to multi-GPU products” sounds like it’s for devices like K80 only. Stuff, related to nvmlGpuTopologyLevel_t and maybe nvmlDeviceGetBAR1MemoryInfo is somewhat closer, but not too much.

yangtehsun · January 21, 2019, 6:25am

I am using 10 x RTX-2080ti in my C621 with PLX8796 system and met same issue.
Does it means Intel C621 with PLX8796 cannot support P2P function?

Thanks in advance.

Lezz · January 21, 2019, 7:24pm

I’ve contacted the server manufacturer’s support (Tyan) couple of days ago. Now waiting for the answer.
Maybe they would be able to shed some light on this. It’s a pity that @Robert_Crovella is not willing to…

njuffa · April 17, 2019, 6:12am

Generally speaking, employees of one vendor publicly commenting on products of another vendor (and in particular, a competitor) is highly problematic, and may have various legal ramifications. Common sense alone therefore favors a no-comment approach, beyond the fact that the subject under discussion here is very specific knowledge likely limited to relatively few people.

The designated sales channel for Tesla GPUs are system integrators, and in particular those which are NVIDIA partners ([url]https://www.nvidia.com/en-us/data-center/where-to-buy-tesla/[/url]). System integrators that sell systems that incorporate Tesla GPUs should be able to answer customer questions regarding specific properties of the systems they offer.

Robert_Crovella · April 17, 2019, 10:53am

I’ve commented in another thread that PCIE P2P is not supported on Titan RTX and it is also not supported on RTX 2080 series.

[url]https://devtalk.nvidia.com/default/topic/1046951/cuda-programming-and-performance/does-titan-rtx-support-p2p-access-w-o-nvlink-/[/url]

Lezz · April 17, 2019, 7:03pm

@Robert_Crovella, OK, thank you for your answer!
Am I correct that this limitation of RTX2080 (compared to GTX1080Ti) is not officially/unofficially documented anywhere?

I’ve only managed to found this:

Even though the only theoretical requirement for GPUDirect RDMA to work between a third-party device
and an NVIDIA GPU is that they share the same root complex, there exist bugs (mostly in chipsets) 
causing it to perform badly, or not work at all in certain setups.

I understand that the best determinant of features like this is what the tool reports. However, sometimes people might want to know before they buy. (For me, personally, this specific feature is not critical.)
Moreover, what the tool reports looks somewhat misleading too - it turns out that “chipset not supported” might be not related to the chipset at all…

@njuffa,
I do agree that it’s usually better to contact your server manufacturer. Note that Tyan is present (#1) in the tesla qualified servers catalog. However, their answer was “OK, we can reproduce this. We’ll ask NVidia what’s happening here.” (I was pretty surprised they bothered to reproduce this). Thus, it looks like even Nvidia partners do not have necessary docs for this.

Robert_Crovella · April 17, 2019, 7:10pm

I don’t know that it is documented anywhere. I don’t know of any sort of formal documentation that is product-specific on P2P capabilities (supported/not supported). I don’t recommend making volume buying decisions for any product without verification of the desired use case(s). For single unit or evaluation quantities, I think many vendors have return policies.

Your quote applies to GPUDirect RDMA, which is not the same as GPUDirect P2P. This thread has P2P in view, I think.

If you’d like to see a change to any aspect of the CUDA ecosystem maintained by NVIDIA e.g. documentation, you can file a bug. The instructions are linked at the top of the CUDA programming sub-forum.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

njuffa · April 17, 2019, 7:32pm

@Lezz: The reaction you got from Tyan is exactly what I would expect from an NVIDIA partner. I have no insights into what NVIDIA does internally to research RDMA capabilities on the myriad Intel and AMD platforms that exist, with new ones being created monthly.

But knowing a bit about general industry practices, I doubt that NVIDIA systematically tests all new platforms that come on the market and keeps an exhaustive internal list that they could simply publish if so inclined. Rather, NVIDIA provides software that lets partners and end users alike test the proper operation of the GPUdirect RDMA capability. If partners then run into issues with a platform for which they want to guarantee their customers full GPUdirect RDMA capability, they can take it up with NVIDIA if they think there is a problem on NVIDIA’s side (rather than on Intel’s or AMD’s side).

When it comes to Tesla GPUs, NVIDIA’s customers are the system integrators. End users in turn are customers of those system integrators. End-user support is furnished by the system integrator who sold the system. The integrators add a nice margin to the price of components including GPUs when they sell their systems to cover their support costs (this is trivial to see in the case of DRAM, for example).

I like car analogies. When your car doesn’t perform to spec because of an issue with one of its components, you take that up with the auto company that sold the car. You don’t contact the suppliers of the components used in that car. The auto company in turn will take up any issues they cannot resolve on their own with their component suppliers. What information a component supplier does or doesn’t furnish to the car manufacturer isn’t the end user’s business.

Lezz · April 18, 2019, 12:21am

@Robert_Crovella, indeed, “GPUDirect p2p” and “GPUDirect RDMA” are not the same. Thanks for clarifying my potentially misleading statement. My idea was that they do have something in common and thus may suffer from the same bug in a chipset. Anyway, that proved to be an entirely wrong direction.

Concerning the bug report, I’ve just filed 2568627 (about the “CNS” in nvidia-smi) and also 2568641 (about misleading error message in “simpleP2P” example).

BTW, here’s one more post about “p2p on RTX2080 not working without NVLINK bridge”.

@njuffa, AFAIU, after all, the issue is not related to some unexpected incompatibility with 3rd party hardware. This functionality was simply intentionally removed.

njuffa · April 18, 2019, 12:39am

I’ll say that there seems to be some confusion here what features we are talking about, and I was specifically addressing RDMA capability which is impacted by the host platform, not just the GPU.

NVIDIA is free to design their products in any way they deem suitable, and since their idea of suitable can and does change over time, one should not make assumptions of the sort that “X works on Y, therefore my expectation is that X also works on Z”.

I’d be the first to agree that NVIDIA’s technical marketing collateral leaves too much guess work for end user trying to find out the feature set of a specific SKU. Calling two pages with a mere smattering of information a “data sheet” borders on the comical as far as I am concerned.

Robert Crovella has pointed out that bug reports can be filed by people who are bothered by this. I find it a bit strange on the part of NVIDIA that they make people jump through such hoops. I have never in my engineering life filed a bug report to get technical marketing materials improved. But I am pretty sure Robert Crovella is not the one setting policy. More information in the hand of consumers should not hurt sales, and may actually help sales (whenever there is a choice, I usually go for the products with the smoothest purchasinh process). The next recession may bring a resurgence of more customer-friendly proactive attitudes. Who knows.

As far as Tesla GPUs are concerned, support is through system integrators, and questions on feature sets are best directed to them, as these GPUs are sold as a component of a system, with the system vendor guaranteeing certain properties of the systems they sell. If you acquired Tesla hardware through some other channel (OEM surplus, second hand), I am afraid you will be pretty much on your own support wise. But your hardware costs will be lower :-)