Hi, we have moved our decoding product from T1000 GCs to A1000 GCs for decoding high density video H64/5. The new A1000 has two decoder engines, but the same memory of 8GB. We have observed a stange issue, when decoding 27 1080p@30FPS (x9 on 3 outputs) our GPU is 36% and decode 54%, see attached image. After 10 seconds of all being OK, the decode engine ramps to 96% and the GPU to 100%… We observed this on the T1000 and discovered that selecting ‘Performance’ over ‘Quality’ in the 3D settings fixed the issue. This is no longer the case. We have also abserved a second Windows image where the problem does not exist (same hardware). Could it be an app on the second image changing some performance settings? Any tests or tools you could point us at would be appriciated. Driver vesrion 31.0.15.5222 April 2024
Hi @BBVDev, could you run in parallel with your application in a terminal the following command:
nvidia-smi dmon -s puct
Generally, it gives a better report of GPU utilization than the Task Manager.
Please consider using NVIDIA Nsight Systems to profile your application and find possible bottlenecks.
Best regards,
Diego
Hi Diego,
Thank you for your support.
Attached is the output from ’
nvidia-smi dmon -s puct’
We have x3 cards in the system, two showing the issue, GPU 0 and 2.
GPU 1 is working OK as expected?
What is pclk?
It seems higher on the working GPU 1?
We already set NVML_CLOCK_SM & NVML_CLOCK_MEM
Can we set ‘pclk’ with one of the following programatically:
typedef enum nvmlClockType_enum
{
NVML_CLOCK_GRAPHICS = 0, //!< Graphics clock domain
NVML_CLOCK_SM = 1, //!< SM clock domain
NVML_CLOCK_MEM = 2, //!< Memory clock domain
NVML_CLOCK_VIDEO = 3, //!< Video encoder/decoder clock domain
// Keep this last
NVML_CLOCK_COUNT //!< Count of clock types
} nvmlClockType_t;
Only seen in the A1000 NOT T1000…
Thanks.
Update,
Using ```
nvidia-smi -i 0 -lgc 1852, 1852
nvidia-smi -i 1 -lgc 1852, 1852
nvidia-smi -i 2 -lgc 1852, 1852
Fixed the issue.
Q. how to do this programatically?
Command from nvml ?
Query max and set?
Thanks.
OK, as we are pushed for release, I’ve tested the following, however, unsure if this is the best route:
result = nvmlDeviceGetMaxClockInfo(nvmlDeviceId, NVML_CLOCK_GRAPHICS, &maxGPUclock);
if (NVML_SUCCESS == result)
result = nvmlDeviceSetGpuLockedClocks(nvmlDeviceId, maxGPUclock, maxGPUclock);
Howeverthe clocks remain at this max pair rate 100% of the time with our application, even for low load use cases where the min to max API call would make more sense but that does not work, both values need to be at max for it to work as required.
Let me know any pointers, cheers.
Hi @BBVDev,
Could you clarify exactly what the issue is? If the GPUs are running at 100% utilization at a lower clock rate, does this affect your application’s decoding throughput? If not, I would recommend you not to change the clock frequencies. It would be better to run it at a lower clock frequency. Moreover, it is not recommended to set the GPU to maximum clock frequency for a long period of time.
If you must for any reason, please consider to use the nvmlDeviceSetApplicationsClocks.
Best regards,
Diego
Hi,
We already call nvmlDeviceSetApplicationsClocks and set to max for both NVML_CLOCK_SM and NVML_CLOCK_MEM, this was good or the previous generation T1000 when decoding.
The issue re-appeared with the A1000 GC and settings these clocks was not enough for the decode performance to match the previous generation T1000 which also had only a single decode engine.
Through this support thread we managed to achieve theT1000 performance on the A1000 card using the additional calls:
nvmlDeviceGetMaxClockInfo and nvmlDeviceSetGpuLockedClocks as detailed above.
The problem is then solved.
My question would be any recommendation or consideration advice as to these two ne API calls being the correct solution?
Thank you in advance.