On Windows, getting list of graphics or compute processes can return invalid argument depending on launch

BlueGoliath · February 20, 2024, 2:45am

Hello?

BlueGoliath · February 24, 2024, 5:23am

Bump.

BlueGoliath · February 28, 2024, 9:45pm

Not fixed in 551.68.

pdinkar · March 1, 2024, 3:55pm

Hi @BlueGoliath There is an internal team tracking this issue. If you could provide a minimal reproducer in C/C++, that would accelerate the process.

BlueGoliath · March 2, 2024, 3:00am

I’m not sure how to make a minimal C/C++ example. I tried this:

#include <cstdlib>
#include <string>

#include <windows.h>

#include "nvml.h"

using namespace std;

nvmlDevice_t device;

unsigned int* graphicsValue = (unsigned int*)malloc(sizeof(unsigned int));
unsigned int* smValue = (unsigned int*)malloc(sizeof(unsigned int));
unsigned int* memoryValue = (unsigned int*)malloc(sizeof(unsigned int));
unsigned int* videoValue = (unsigned int*)malloc(sizeof(unsigned int));

void doUpdate(nvmlClockType_t clockType, unsigned int* valuePointer)
{
    nvmlReturn_t returnValue;

    returnValue = nvmlDeviceGetMaxClockInfo(
            device,
            clockType,
            valuePointer);
    
    if(NVML_SUCCESS == returnValue && *valuePointer == 0)
    { 
        printf("BUG");
        fflush(stdout);
    }
    else if(NVML_SUCCESS != returnValue)
    {
        printf("FAIL");
        fflush(stdout);
    }
}

unsigned int loopGraphicsUpdate(void* ptr)
{
    while(true)
    {
        doUpdate(NVML_CLOCK_GRAPHICS, graphicsValue);
        Sleep(1);
    }
    
    return 0;
}

unsigned int loopSMUpdate(void* ptr)
{
    while(true)
    {
        doUpdate(NVML_CLOCK_SM, smValue);
        Sleep(1);
    }
    
    return 0;
}

unsigned int loopMemoryUpdate(void* ptr)
{
    while(true)
    {
        doUpdate(NVML_CLOCK_MEM, memoryValue);
        Sleep(1);
    }
    
    return 0;
}

unsigned int loopVideoUpdate(void* ptr)
{
    while(true)
    {
        doUpdate(NVML_CLOCK_VIDEO, videoValue);
        Sleep(1);
    }
    
    return 0;
}

int main(int argc, char** argv) {
    
    nvmlReturn_t returnValue = nvmlInit();
    
    nvmlDevice_t* devicePointer = (nvmlDevice_t*)malloc(sizeof(nvmlDevice_t*));
    
    returnValue = nvmlDeviceGetHandleByIndex_v2(0, devicePointer);
    
    device = *devicePointer;
    
    HANDLE graphicsThread = CreateThread( 
            NULL,
            0,
            loopGraphicsUpdate,
            NULL, 
            0,
            NULL);

    HANDLE smThread = CreateThread( 
            NULL,
            0,
            loopSMUpdate,
            NULL, 
            0,
            NULL);
    
    HANDLE memoryThread = CreateThread( 
            NULL,
            0,
            loopMemoryUpdate,
            NULL, 
            0,
            NULL);
    
    HANDLE videoThread = CreateThread( 
            NULL,
            0,
            loopVideoUpdate,
            NULL, 
            0,
            NULL);

    Sleep(600000000);
    
    return 0;
}

But it’s not the same. The actual call to nvmlDeviceGetMaxClockInfo gets done on a random thread in a ScheduledExecutorService. The number of threads is by default equal to the number of logical threads.

For a more complete picture, here is the class file in full:

package com.bluegoliath.envious.nvml.local.attributes.clocks;

import java.util.List;
import java.util.Optional;
import com.bluegoliath.bindings.nvml.enums.nvmlClockType_t;
import com.bluegoliath.bindings.nvml.enums.nvmlReturn_t;
import com.bluegoliath.bindings.nvml.nvml_h;
import com.bluegoliath.envious.base.enums.Unit;
import com.bluegoliath.envious.nvml.local.internal.NVMLLocalGPUInternal;
import com.bluegoliath.crosspoint.values.NativeInteger;
import com.bluegoliath.envious.base.abstracts.internal.NVNumberAttributeGenericBase;
import com.bluegoliath.envious.nvml.local.internal.NVMLContextInternal;
import com.bluegoliath.oroc.interfaces.EnumStringProvider;

public class NVMLGPUClockMaxAttribute extends NVNumberAttributeGenericBase<Integer, nvmlReturn_t, NVMLLocalGPUInternal>
{
    private final NativeInteger valuePointer;
    
    private final nvmlClockType_t type;
    
    private final String description;
    
    private final EnumStringProvider<nvmlClockType_t> provider;
    
    public NVMLGPUClockMaxAttribute(NVMLLocalGPUInternal gpu, nvmlClockType_t type, String description, EnumStringProvider<nvmlClockType_t> provider)
    {
        super(gpu, "Clock Max", Unit.MEGAHERTZ);
        
        this.valuePointer = gpu.getAccountingAllocator().newNativeValue(NativeInteger.METADATA);
        
        this.type = type;
        this.description = description;
        this.provider = provider;
    }
    
    @Override
    public nvmlReturn_t update()
    {
        long startTime = System.currentTimeMillis();
        nvmlReturn_t returnValue = null;
        
        try
        {
            returnValue = nvml_h.INSTANCE.nvmlDeviceGetMaxClockInfo(
                    super.getNVDevice().get().getNativePointer(),
                    this.type,
                    this.valuePointer);
        }
        catch (Throwable ex)
        {
            ex.printStackTrace();
        }
        
        super.finishUpdate(returnValue, this.valuePointer.get(), System.currentTimeMillis() - startTime);
        
        return returnValue;
    }
    
    @Override
    public nvmlReturn_t getSuccessReturnValue()
    {
        return nvmlReturn_t.NVML_SUCCESS;
    }
    
    @Override
    public String getReturnString(nvmlReturn_t value)
    {
        return NVMLContextInternal.toString(value);
    }
    
    @Override
    public Optional<String> getDescription()
    {
        return Optional.of(this.description);
    }
    
    @Override
    public Optional<String> getContextualString(int index)
    {
        switch(index)
        {
            case 0:
                return Optional.of(this.getNVDevice().get().toString());
            case 1:
                return Optional.of(this.provider.getResultString(this.type));
            default:
                return Optional.empty();
        }
    }
    
    @Override
    public List<nvmlReturn_t> getReturnValues()
    {
        return List.of(
                nvmlReturn_t.NVML_SUCCESS,
                nvmlReturn_t.NVML_ERROR_UNINITALIZED,
                nvmlReturn_t.NVML_ERROR_INVALID_ARGUMENT,
                nvmlReturn_t.NVML_ERROR_NOT_SUPPORTED,
                nvmlReturn_t.NVML_ERROR_GPU_IS_LOST,
                nvmlReturn_t.NVML_ERROR_UNKNOWN);
    }
}

getAccountingAllocator just makes calls to the platform’s malloc stdlib function and keeps track of allocations.

I don’t really know what else to provide. Like I said, this max clock issue and the process function issue DOES NOT happen on Linux:

BlueGoliath · March 2, 2024, 3:31am

Also re: pcie function issues since it seems sorta related, you can see a bug report here:

github.com/BlueGoliath/Envious-FX

java.util.NoSuchElementException: No value present

opened 10:18AM - 12 Oct 22 UTC

closed 02:43PM - 13 Oct 22 UTC

ennerf

I downloaded the release version, ran `./run` from powershell in the bin directo…ry, and got the following error: **Output** ``` WARNING: Unknown module: org.goliath.bindings.gamemode specified to --enable-native-access WARNING: Unknown module: org.goliath.bindings.nvxctrl specified to --enable-native-access WARNING: Unknown module: org.goliath.bindings.x specified to --enable-native-access Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetTargetFanSpeed Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetGpcClkMinMaxVfOffset Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetMemClkVfOffset Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetMemClkMinMaxVfOffset Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceSetMemClkVfOffset Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetPowerMode Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetSupportedPowerModes Oct 12, 2022 12:07:52 PM org.goliath.crosspoint.interfaces.NativeLibrary getSymbolOrStub WARNING: generating stub for missing symbol nvmlDeviceGetPowerMode Exception in Application start method java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.LauncherImpl.launchApplicationWithArgs(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.LauncherImpl.launchApplication(Unknown Source) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at java.base/sun.launcher.LauncherHelper$FXHelper.main(Unknown Source) Caused by: java.lang.RuntimeException: Exception in Application start method at javafx.graphics@19/com.sun.javafx.application.LauncherImpl.launchApplication1(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.LauncherImpl.lambda$launchApplication$2(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: java.util.NoSuchElementException: No value present at java.base/java.util.Optional.get(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.monitoring.MonitoringContentPane.<init>(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.content.monitoring.PCIeContentItem.<init>(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.content.MonitoringContentItem.<init>(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.MainAppContent.<init>(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.AppRoot.<init>(Unknown Source) at org.goliath.envious.fx.app@1/org.goliath.envious.fx.app.GoliathEnviousFX.start(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.LauncherImpl.lambda$launchApplication1$9(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.PlatformImpl.lambda$runAndWait$12(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.PlatformImpl.lambda$runLater$10(Unknown Source) at java.base/java.security.AccessController.doPrivileged(Unknown Source) at javafx.graphics@19/com.sun.javafx.application.PlatformImpl.lambda$runLater$11(Unknown Source) at javafx.graphics@19/com.sun.glass.ui.InvokeLaterDispatcher$Future.run(Unknown Source) at javafx.graphics@19/com.sun.glass.ui.win.WinApplication._runLoop(Native Method) at javafx.graphics@19/com.sun.glass.ui.win.WinApplication.lambda$runLoop$3(Unknown Source) ... 1 more Exception running application org.goliath.envious.fx.app.GoliathEnviousFX ``` **NVIDIA System Information** ``` [Display] Operating System: Windows 10 Enterprise, 64-bit DirectX version: 12.0 GPU processor: NVIDIA GeForce RTX 2060 Driver version: 516.94 Driver Type: DCH Direct3D feature level: 12_1 CUDA Cores: 1920 Core clock: 1755 MHz Memory data rate: 14.00 Gbps Memory interface: 192-bit Memory bandwidth: 336.05 GB/s Total available graphics memory: 22484 MB Dedicated video memory: 6144 MB GDDR6 System video memory: 0 MB Shared system memory: 16340 MB Video BIOS version: 90.06.3F.00.E8 IRQ: Not used Bus: PCI Express x16 Gen3 Device Id: 10DE 1F08 3FC11458 Part Number: G161 0042 [Components] nvui.dll 8.17.15.1694 NVIDIA User Experience Driver Component nvxdplcy.dll 8.17.15.1694 NVIDIA User Experience Driver Component nvxdbat.dll 8.17.15.1694 NVIDIA User Experience Driver Component nvxdapix.dll 8.17.15.1694 NVIDIA User Experience Driver Component NVCPL.DLL 8.17.15.1694 NVIDIA User Experience Driver Component nvCplUIR.dll 8.1.940.0 NVIDIA Control Panel nvCplUI.exe 8.1.940.0 NVIDIA Control Panel nvWSSR.dll 31.0.15.1694 NVIDIA Workstation Server nvWSS.dll 31.0.15.1694 NVIDIA Workstation Server nvViTvSR.dll 31.0.15.1694 NVIDIA Video Server nvViTvS.dll 31.0.15.1694 NVIDIA Video Server nvLicensingS.dll 6.14.15.1694 NVIDIA Licensing Server nvDevToolSR.dll 31.0.15.1694 NVIDIA Licensing Server nvDevToolS.dll 31.0.15.1694 NVIDIA 3D Settings Server nvDispSR.dll 31.0.15.1694 NVIDIA Display Server nvDispS.dll 31.0.15.1694 NVIDIA Display Server PhysX 09.21.0713 NVIDIA PhysX NVCUDA64.DLL 31.0.15.1694 NVIDIA CUDA 11.7.101 driver nvGameSR.dll 31.0.15.1694 NVIDIA 3D Settings Server nvGameS.dll 31.0.15.1694 NVIDIA 3D Settings Server ```

I fixed this by single-threading all PCIe calls. PCIe gen/width/speed are updated in this class by the mentioned executor:

package com.bluegoliath.envious.fx.platform.internal;

import com.bluegoliath.bindings.nvml.enums.nvmlReturn_t;
import com.bluegoliath.oroc.interfaces.NumberReadable;

public class PCIeHotfix implements Runnable
{
    private final NumberReadable<Integer, nvmlReturn_t> gen;
    private final NumberReadable<Integer, nvmlReturn_t> width;
    private final NumberReadable<Integer, nvmlReturn_t> speed;
    
    public PCIeHotfix(NumberReadable<Integer, nvmlReturn_t> gen, NumberReadable<Integer, nvmlReturn_t> width, NumberReadable<Integer, nvmlReturn_t> speed)
    {
        this.gen = gen;
        this.width = width;
        this.speed = speed;
    }
    
    @Override
    public void run()
    {
        this.gen.update();
        this.width.update();
        this.speed.update();
    }
}

BlueGoliath · March 12, 2024, 11:39pm

Anything?

BlueGoliath · March 21, 2024, 7:05pm

Newest driver still has issues.

BlueGoliath · April 4, 2024, 4:55pm

Still not fixed with newest driver.

BlueGoliath · April 16, 2024, 10:19pm

552.22 not fixed.

faz · April 24, 2024, 10:10am

Lol. So erm could you explain why you expect complete randos to do stuff when Nvidia team is trying their hardest to solve the issues? Seems you don’t have voltage control by the way, nice one.

@pdinkar , any way to set a max voltage / modify the frequency at each voltage? I see no way in nvml currently to do so. Having a maximum voltage control would allow us to undervolt efficiently under Linux as well.

pdinkar · July 5, 2024, 10:13pm

Hello @BlueGoliath , We have tried to reproduce the issue with your reproducer code on multiple different boards, running in a multi-threaded environment for an extended period of time. We could not see the issue with the NVML API calls. Regardless, we are modifying NVML to return an error on 0 max clock values.

@faz Nvidia currently does not offer voltage control.

BlueGoliath · July 6, 2024, 1:47am

Thanks for looking into it I guess. I’ve just disabled multi-threading by default since on Linux I’m getting the opposite problem where functions share some kind of lock. It never used to be like that on Linux a few years ago. Maybe this is a bug in the JDK.

Topic		Replies	Views
nvmlDeviceSetDefaultFanSpeed_v2 does not resume fan speed algorithm! Please fix! Linux	1	922	May 16, 2022
How to call NVML APIs? CUDA Programming and Performance	5	17335	October 18, 2011
RmInitAdapter failed! since kernel > 6.4 Linux kernel	28	3425	November 5, 2024
NVML 12.535.43.02 breaks backwards compatibility System Management and Monitoring (NVML)	15	2243	November 16, 2023
CentOS 7 headless with nVidia drivers installed, OpenGL not using nVidia drivers, only llvmpipe Linux opengl , linux	44	5105	May 10, 2022
No GLX with nvidia prime and Ubuntu 14.04 Linux	9	22171	October 5, 2014
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62252	February 14, 2021
(2) GeForce 210, 4 displays, only first GPU recognized using nvidia proprietary drivers Linux	6	1900	March 24, 2015
Black screen after booting with no option linux OS Linux	26	1638	February 3, 2021
Nvidia-uvm module bug on suspend Linux	14	1701	December 7, 2023

On Windows, getting list of graphics or compute processes can return invalid argument depending on launch

Related topics