I have a Dell R730 Server with NVIDIA Tesla M60 GPU card installed with Dell GPU Enablement Kit, I am trying to generate full load on the GPU to check the system stability, I have downloaded a tool called V-Ray Benchmark test tool and run it on my system with Windows Server 2022 OS, but the M60 only consume about 190W power consumption, 95W per M60 GPU core, but not the full which is 150W per core total 300W.
Please check below Youtube for the test.
I have tried the FurMark but it said that my card driver only supports OpenGL 1.1 not the minimum requurement which is 2.1, so the tool cannot be executed. I also tried the 3DMark but only one M60 GPU core has been stressed, not both GPU cores, and even with one GPU core stressed, it also is not with maximum 150W but only 6xW.
Here is the HW configuration:
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
RAM: DDR4 16GB x4 (A1/A2/B1/B2)
PSU: 1100W x2, redundancy policy and hot spare are disabled from iDRAC
Power cap disabled from iDRAC
No SATA/SAS drives installed
No RAID card installed
BCM OCP LAN card installed
Only one M.2 SSD with OS pre-installed
Here is the FW/SW configuration:
NVIDIA DataCenter driver: v566.03
NVIDIA M60 firmware: core1-84.04.A6.00.01, core2-84.04.A6.00.02
Dell R730 BIOS: latest version 2.19.0
Dell R730 iDRAC: latest version v2.86.86.86
OS: Windows Server 2022 Datacenter
Questions: 1. I would like to understand if it is a HW or SW configuration limited to load the M60 power, or the tool limitation. If it is a tool limitation, please advise another tool. 2. Why the firmware on two GPU cores are different? 3. Is the GPU firmware version the latest one? If not, where I can get it?
Finding a test setup that pegs a card at its maximum possible consumption is difficult. It will be even harder given that there are two GPUs on the M60. If it were me, I would probably try running the nbody sample code in benchmark mode. Fiddle with all the parameters for that sample code until you get it delivering highest power consumption on a single GPU. Then run two instances of it, one on each GPU.
There is no particular reason to expect the vbios version to be identical between the two GPUs. The M60 is not exactly the same as having two GPUs plugged into two slots, and its quite possible that it is designed to have a different vbios version the “first” GPU of the pair, compared with the “second”.
NVIDIA doesn’t generally provide VBIOS field upgrade tools across the board for every GPU. They are provided in special situations. There is no repository provided by NVIDIA to the public where you can go and get the “latest” VBIOS version for every GPU, and its not recommended in the general case to attempt to upgrade your VBIOS version, unless you have been given specific instructions to do so from an OEM. In that case, they will give you the tool and the VBIOS image to flash.
Thank you for detailed information, I appreciated.
About the first answer, could you share me how to run the nbody sample code in benchmark mode? I am not a programmer guy, if you can help, could you guide me step-by-step to run it?
I have not used an M60, but for stress testing (thermals / power) I like to use the Folding @ Home client, which tends to max out GPUs pretty reliably. The client is officially for Windows 10 / 11, but I assume it would also work with Windows Server 2022. I don’t know that for sure.
Total time for download and installation is just a few minutes. You may need administrative privileges to install. On my Windows workstation I gave my user account higher than default privileges, which makes it easy to miss where elevated privileges are required.
FurMark used to be a good “max power” program for GPUs in the past, but stopped fufilling that role more than ten years ago.
Thanks for your suggestion, I have tried the Fold@Home and it could make my M60 about 250~270W power consumption, even it cannot be 300W but I think it should be good enough.
Check the current Power Limit setting in the output of nvidia-smi to make sure it is not set lower than the maximum possible power limit.
(1) GPU-Z and nvidia-smi both retrieve their GPU data from NVIDIA 's management library NVML, so it is not necessary to run both.
(2) The sampling speed for GPU-Z is configurable, and a somewhat higher resolution provides less smoothing and better visibility of power peaks. I would try 0.5 sec sampling. I think the lowest granularity if offers is 0.1 sec, but that usually starts to generate too much overhead, actually slowing down Folding@Home
(3) GPU-Z allows for easy display of the maximum of each metric over the monitored time period. There may be power peaks not easily visible from the graph.
(4) I looked up the M60 and it is a very low end GPU by today’s standards. That could severely limit the choice of work units Folding @ Home has available for this GPU, and may also mean none of the available work units are well-tuned for this CC 5.2 hardware.
Checked that there is no limited be set for power.
C:\Users\Administrator>nvidia-smi -q -d POWER
==============NVSMI LOG==============
Timestamp : Wed Oct 30 13:45:48 2024
Driver Version : 566.03
CUDA Version : 12.7
Attached GPUs : 2
GPU 00000000:06:00.0
GPU Power Readings
Power Draw : 15.93 W
Current Power Limit : 150.00 W
Requested Power Limit : 150.00 W
Default Power Limit : 150.00 W
Min Power Limit : 112.50 W
Max Power Limit : 162.00 W
Power Samples
Duration : 35.41 sec
Number of Samples : 119
Max : 19.82 W
Min : 14.23 W
Avg : 15.67 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
GPU 00000000:07:00.0
GPU Power Readings
Power Draw : 13.98 W
Current Power Limit : 150.00 W
Requested Power Limit : 150.00 W
Default Power Limit : 150.00 W
Min Power Limit : 112.50 W
Max Power Limit : 162.00 W
Power Samples
Duration : 35.36 sec
Number of Samples : 119
Max : 14.96 W
Min : 13.86 W
Avg : 14.47 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
C:\Users\Administrator>
Yes, the M60 is an old product for NVIDIA, but the purpose to generate full power consumption on M60 is only to ensure the system stability especially on thermal/power.
FWIW, the power limit clearly is not configured as the maximum possible. The maximum possible limit is 162W, but the currently enforced limit is 150W. If you want, you can use nvidia-smi to raise the power limit to the maximum of 162W that is allowed.
Obviously, since the power draw with Folding @ Home only ranges up to about 132W in your video, raising the power limit will not have any effect when running with the currently available work units. But in the future other F@H work units may become available that lead to a higher power draw.
Got it, I can try to change the current power limit to 162W and run it again, but believes that would not have difference from the last test, as you said the Folding @ Home only ranges up to about 132W.
The power draw of Folding @ Home depends on the type of work unit being used. The project ID in the F@H dashboard identifies the work unit type. The performance characteristics of F@H work units differ widely, as they are provided by different programmers working on different kinds of simulations. Some are more compute intensive, others more memory intensive.
My memory is hazy, but I seem to recall that there used to be a way to request only work units from particular projects. But the new streamlined web-based interface seems to have removed this choice (or maybe I simply have not found yet where relevant configuration knobs are hidden).
If I come across that information, I will make a note here.
BTW, if you would like to support the Folding @ Home project, consider joining team whoopass (team ID: 131015), which is the F@H team of the original CUDA engineering team at NVIDIA.
I started contributing on June 15, 2008 with one of the first GT200 available internally to NVIDIA engineers. At that time we had permission from management to use the company’s equipment for this to show off the power of GPU computing, then still in its infancy. I am still a regular contributor, currently trying to snag the number 1 position on the team.
The “cause” selector restricts the client to consider only work units from projects that are tagged with that cause. During the pandemic “covid” was a very popular cause, with half a dozen projects tagged with that cause. Long term favorite is probably “cancer”, which comprises multiple projects at any given time.
I initially started out supporting the “cancer” cause, but for years now I have been supporting all causes.
I seem to recall that the selection of specific projects or exclusion of certain projects was supported because some projects would at times go haywire, causing all work units to fail, causing clients to idle unintentionally, and to allow the testing of new work units for specific projects. The feature may have been removed as F@H became more stable over time.