On Windows, getting list of graphics or compute processes can return invalid argument depending on launch

Instead of returning insufficient size, as they are supposed to, they can sometimes return invalid argument with zero code changes on Windows between different application launches for the duration of the application. This does not happen on Linux.

Please fix.

App that shows the problem: Release V1-Beta-1 · BlueGoliath/Envious-FX · GitHub

Can someone please look into this?

Same issue with the newest driver.

Can someone at Nvidia please fix this?

This issue is currently under investigation. We will respond once we have an update.

Thank you.

On the newest driver, it seems like yet another issue was introduced where getting max clocks via nvmlDeviceGetMaxClockInfo can sometimes return NVML_SUCCESS but fills the pointer with zero. This appears to be similar to another bug with acoustic threshold over a year ago.

Can someone at Nvidia please look at the Windows version of NVML? It’s full of so many bugs. Nearly every single PCIe function related to link generation, width, and speed has multi-threading-related bugs as well.

I know NVML isn’t a priority for anyone but these bugs have existed for so long and you aren’t releasing the full version of NVAPI so it’s not like anyone who wants to get this information has much of a choice in using it.

1 Like

Tried the newest driver, still not fixed:

Attribute: Memory Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: Video Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: Graphics Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: Memory Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: Graphics Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: Video Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: SM Clock Max
Value: 0
Result: NVML_SUCCESS
Attribute: SM Clock Max
Value: 0
Result: NVML_SUCCESS

Issues still exist on the newest driver.

Newest 546.17 driver still has issues.

How has this not been fixed yet?

Is this just not getting fixed?

Hi @BlueGoliath @TomNVIDIA , We need additional information to proceed.

  1. Are the issues with nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetGraphicsRunningProcesses APIs still being seen in the latest driver?
  2. For the issue with nvmlDeviceGetMaxClockInfo, what GPU is this being run on? what is the output from nvidia-smi -q?
  3. What is the windows OS version in use?
  1. Yes, it happens on the 551.23 driver.

Edit: re: 1, maybe not. I was thinking of the clock max issue. That still happens. I’ll try to see if the process-related bug happens.

  1. GTX 1080. It’s partially broken(crashes doing anything 3D) though. If need be, I can throw in a 960 to verify it isn’t hardware-related.

  2. Windows 10 19045.

Code, if it helps any:

@Override
    public synchronized nvmlReturn_t update()
    {
        long startTime = System.currentTimeMillis();
        nvmlReturn_t returnValue = null;
        
        try
        {
            returnValue = nvml_h.nvmlDeviceGetMaxClockInfo(
                    super.getNVDevice().get().getNativePointer(),
                    this.type,
                    this.valuePointer);
        }
        catch (Throwable ex)
        {
            ex.printStackTrace();
        }
        
        if(this.valuePointer.intValue() == 0)
        {
            System.out.println("BUG");
        }
        
        super.finishUpdate(returnValue, this.valuePointer.get(), System.currentTimeMillis() - startTime);
        
        return returnValue;
    }

I can’t get the process list bug to happen. Maybe it’s fixed or maybe I’m just getting lucky. Is there an internal bug report about this?

And are you able to reproduce the max clock bug?

Is this getting fixed?

Hi @BlueGoliath ,
For the process list API, glad to hear you do not have an issue with that any longer.

For the clock API, I was unable to reproduce the issue. The reproduction was attempted on a system with NVIDIA GeForce GTX 1080, with Windows 10 (19045), 551.23 driver.
To see the values reported by the driver, run nvidia-smi -q
In the command output, under subsection Clocks, you should see the values being reported.

Did you try to reproduce using a multi-threaded application? My app, Envious FX, is extremely multi-threaded and calls multiple dozens of NVML functions every second concurrently. nvmlDeviceGetMaxClockInfo alone gets called multiple times a second. You would not likely see the issue with nvidia-smi.

Edit: to clarify, the bug is not “this function always returns garbage values” but “it sometimes returns garbage values”. I think it’s because of a multi-threading bug in NVML.

I don’t understand why it’s so hard to find and fix this. Does no one at Nvidia have a multi-threaded application that uses NVML?

This is 100% a multi-threading issue. In fact, it appears to be the same one that plagues PCIe generation, bus width, and speed current/max but instead of throwing an error it just sometimes returns 0.

I’ve uploaded a special version of my app with Graphics/SM/Memory/Video max monitors here:

Would someone look into this?