Using remote GPU memory access causes power error in DGX-1

sdutt004 · April 16, 2021, 8:17pm

Hello

I am executing my program in a DGX-1 box where there are 8 P100 GPUs connected through NVLink version 1.0. I have allocated my memory on GPU A and trying to access it from GPU B via NVLink. I am using cudasetdevice(B) to set the current device and then using cudaDeviceEnablePeerAccess(A,0) to enable the remote peer access. However, after I execute my program I am seeing a power error in the power column after executing nvidia-smi command. It is a very simple program where I am just accessing the remote GPU memory and performing some simple measurements. This error is affecting the access time messing up my measurements. The error appears even if I do not perform any measurements and just access the remote GPU memory. The installed driver version is 410.79 and the cuda runtime version is 10.0. I was wondering if someone could provide me some indication as what might be the reason behind the errors and what can be done to avoid it. Please let me know if you require any further information. Thank you.

njuffa · April 16, 2021, 9:35pm

What exactly is a “power error”? I have used CUDA for more than a dozen years, and all kinds of GPUs, but that term means nothing to me. An internet search for power error plus DGX-1 turned up nothing.

To my knowledge, all NVIDIA DGX systems include comprehensive lifetime support, and I wouldn’t expect any less given that the DGX-1 reportedly sells for US$129,000. Why are you asking here instead of addressing the issue through the vendor’s designated support channel?

Here are all items reported by nvidia-smi on my system that contain the word “Power” No “Power Error”.

   SW Power Cap                      : Not Active
        HW Power Brake Slowdown       : Not Active
Power Readings
    Power Management                  : Supported
    Power Draw                        : 5.17 W
    Power Limit                       : 125.00 W
    Default Power Limit               : 125.00 W
    Enforced Power Limit              : 125.00 W
    Min Power Limit                   : 100.00 W
    Max Power Limit                   : 125.00 W

You might want to cut & paste the exact command line you used to invoke nvidia-smi, and cut & paste the output produced by that invocation.

sdutt004 · April 16, 2021, 10:40pm

I am really sorry but just need to mention here that I have also been using GPUs for more than a decade. Not only nvidia gpus but different GPUs and accelerators from varied vendors. But I have never felt that I won’t learn something new or encounter new impediments in my job. If you feel that you need some clarification you can simply ask for it rather than being condescending in your reply. If you can help me in certain way then I would appreciate that or you can simply refrain yourself from providing me a reply. You said that your google search didn’t provided you any result. But I am not sure if that occurred to you that may be for the same reason I turned towards nvidia forums for some sort of answer if possible. Being that said I have attached an image about the error I have mentioned. You can see an ERR! message on GPU 0,1,2 and 5 under the power column. Also nvidia-smi -q showed that there is unknown power draw error clocks error as well.

.

I was expecting if someone from NVidia could help me towards some helpful response. If you can help me to sort out the error or point me towards some helpful resources then I would really appreciate. Otherwise you can simply abstain yourself from providing me a disdainful response.

njuffa · April 16, 2021, 11:17pm

Your post is off-topic in this sub-forum. You are more than welcome to wait for someone from NVIDIA to come along and help you with this issue. Or use the dedicated support channel associated with a DGX-1 system.

Robert_Crovella · April 17, 2021, 12:24am

I’m sorry you’re having an issue with your DGX.

That would be my advice. If you purchased a DGX-1 you very likely purchased a support agreement along with it. My advice would be to take advantage of that, and use the support portal. You will get a ticket assigned to your issue, and NVIDIA support engineering will get it reviewed and resolved for you.

enterprise support landing page

enterprise support portal

Topic		Replies	Views
GPU Memory Usage shows "N/A" CUDA Setup and Installation	15	34135	May 22, 2024
DGX1 System is too slow. nvidia-smi, SSD disk speed. Help me, plz DGX User Forum	5	1488	September 14, 2021
GPU is lost. Reboot the system to recover this GPU DGX User Forum hw , kernel	3	5285	March 8, 2022
Idle power usage stuck at 10-20watts after running an app GPU - Hardware power , linux , driver , nvidia-smi	8	6446	October 17, 2022
NVIDIA-SMI Shows ERR! on both Fan and Power Usage General Discussion ubuntu	0	1239	October 25, 2022
Nvidia-smi report wrong gpu related frequency and power info Drivers - Linux, Windows, MacOS windows-driver	7	4917	May 10, 2023
P2p Bandwidth 150% higher than maximum achievable CUDA Programming and Performance cuda , ubuntu	10	2636	April 11, 2023
Power consumption of GPUs does not go above 100W - nvidia-smi CUDA Programming and Performance hw	2	2240	June 15, 2023
GPU utilization DGX User Forum	8	6502	August 21, 2019
XID Errors in DGX-1 (GPU's don't start) DGX User Forum	2	1345	April 1, 2022

Using remote GPU memory access causes power error in DGX-1

Related topics