Using remote GPU memory access causes power error in DGX-1


I am executing my program in a DGX-1 box where there are 8 P100 GPUs connected through NVLink version 1.0. I have allocated my memory on GPU A and trying to access it from GPU B via NVLink. I am using cudasetdevice(B) to set the current device and then using cudaDeviceEnablePeerAccess(A,0) to enable the remote peer access. However, after I execute my program I am seeing a power error in the power column after executing nvidia-smi command. It is a very simple program where I am just accessing the remote GPU memory and performing some simple measurements. This error is affecting the access time messing up my measurements. The error appears even if I do not perform any measurements and just access the remote GPU memory. The installed driver version is 410.79 and the cuda runtime version is 10.0. I was wondering if someone could provide me some indication as what might be the reason behind the errors and what can be done to avoid it. Please let me know if you require any further information. Thank you.

What exactly is a “power error”? I have used CUDA for more than a dozen years, and all kinds of GPUs, but that term means nothing to me. An internet search for power error plus DGX-1 turned up nothing.

To my knowledge, all NVIDIA DGX systems include comprehensive lifetime support, and I wouldn’t expect any less given that the DGX-1 reportedly sells for US$129,000. Why are you asking here instead of addressing the issue through the vendor’s designated support channel?

Here are all items reported by nvidia-smi on my system that contain the word “Power” No “Power Error”.

   SW Power Cap                      : Not Active
        HW Power Brake Slowdown       : Not Active
Power Readings
    Power Management                  : Supported
    Power Draw                        : 5.17 W
    Power Limit                       : 125.00 W
    Default Power Limit               : 125.00 W
    Enforced Power Limit              : 125.00 W
    Min Power Limit                   : 100.00 W
    Max Power Limit                   : 125.00 W

You might want to cut & paste the exact command line you used to invoke nvidia-smi, and cut & paste the output produced by that invocation.

I am really sorry but just need to mention here that I have also been using GPUs for more than a decade. Not only nvidia gpus but different GPUs and accelerators from varied vendors. But I have never felt that I won’t learn something new or encounter new impediments in my job. If you feel that you need some clarification you can simply ask for it rather than being condescending in your reply. If you can help me in certain way then I would appreciate that or you can simply refrain yourself from providing me a reply. You said that your google search didn’t provided you any result. But I am not sure if that occurred to you that may be for the same reason I turned towards nvidia forums for some sort of answer if possible. Being that said I have attached an image about the error I have mentioned. You can see an ERR! message on GPU 0,1,2 and 5 under the power column. Also nvidia-smi -q showed that there is unknown power draw error clocks error as well.

. Screenshot 2 I was expecting if someone from NVidia could help me towards some helpful response. If you can help me to sort out the error or point me towards some helpful resources then I would really appreciate. Otherwise you can simply abstain yourself from providing me a disdainful response.

Your post is off-topic in this sub-forum. You are more than welcome to wait for someone from NVIDIA to come along and help you with this issue. Or use the dedicated support channel associated with a DGX-1 system.

I’m sorry you’re having an issue with your DGX.

That would be my advice. If you purchased a DGX-1 you very likely purchased a support agreement along with it. My advice would be to take advantage of that, and use the support portal. You will get a ticket assigned to your issue, and NVIDIA support engineering will get it reviewed and resolved for you.

enterprise support landing page

enterprise support portal