How can I tell whether an EXCLUSIVE_PROCESS-mode GPU is "taken" or not?

Eyal.Rozenberg · November 22, 2023, 1:43pm

Suppose I have set a GPU to have an EXCLUSIVE_PROCESS compute mode, using:

nvidia-smi -i 0 --compute-mode=EXCLUSIVE_PROCESS

I want to check, programmattically , whether any process has already “caught” that GPU (i.e. has created and not destroyed a context associated with the GPU).

Now, I could check by trying to create a context myself; but - that means that during my check, I am monopolizing the GPU - which I don’t want to do. Can I check this other than by trying to contend for being the exclusively-accessing process?

Robert_Crovella · November 22, 2023, 3:08pm

There is no safe method to do so. After you check (using whatever method), another process could swoop in and “claim” the GPU, before you have actually gotten to claiming it yourself.

Furthermore, any such method probably isn’t sensible using a program with the CUDA runtime API anyway. Any usage of the CUDA runtime API will typically auto-generate a CUDA context, before you have done anything, which will fail in compute exclusive mode if another process is already using it.

You could probably use nvidia-smi or an equivalent sequence of calls using NVML to get a instantaneous check of whether another process is using that GPU, but again, that doesn’t create a reservation, and by the time you go to actually “claim” it, it may be too late anyway.

The typically suggested method to share a resource like this in a sane way is to use a job scheduler like SLURM.

Eyal.Rozenberg · November 22, 2023, 3:18pm

After you check (using whatever method), another process could swoop in and “claim” the GPU

I realize that a check will not be safe against race conditions. But suppose I don’t care about those. Example:

I want to print this information to the user.
I have a guarantee that no other process concurrent with me is monopolizing a GPU, but I don’t have this guarantee about the past.

You mentioned a sequence of NVML calls. What sequence? I tried nvmlDeviceGetComputeRunningProcesses(), but that doesn’t seem to do the trick.

The typically suggested method to share a resource like this in a sane way is to use a job scheduler like SLURM.

It’s not my call… I’m trying to help out in a case where EXCLUSIVE_PROCESS is existing behavior of some existing app, and I want visibility for which process has monopolized which GPU.

Robert_Crovella · November 22, 2023, 3:30pm

Questions about NVML usage should be directed to the NVML forum. Rather than saying “that doesn’t seem to do the trick” you might be better off if you provide a fully worked example of what you tried.

Do as you wish, of course. Just making suggestions here. Also, regarding NVML, it does not necessarily support every possibility on every GPU type. For example, the support footprint on GeForce GPUs is notably less than for Datacenter GPUs. I don’t happen to know if that would be applicable here or not.

Eyal.Rozenberg · November 22, 2023, 3:33pm

Oh, sorry, I didn’t notice there was a separate NVML forum. I’ll ask there and provide more information like you suggested.

Robert_Crovella · November 22, 2023, 4:02pm

On a “standard” CUDA linux install, there is an NVML example code given in /usr/local/cuda/nvml/example.

If I modify that code (CUDA 12.2) as follows:

        // This is a simple example on how you can modify GPU's state
        result = nvmlDeviceGetComputeMode(device, &compute_mode);
        if (NVML_ERROR_NOT_SUPPORTED == result)
            printf("\t This is not CUDA capable device\n");
        else if (NVML_SUCCESS != result)
        {
            printf("Failed to get compute mode for device %u: %s\n", i, nvmlErrorString(result));
            goto Error;
        }
        else
        {
#if 0
                // try to change compute mode
            printf("\t Changing device's compute mode from '%s' to '%s'\n",
                    convertToComputeModeString(compute_mode),
                    convertToComputeModeString(NVML_COMPUTEMODE_PROHIBITED));

            result = nvmlDeviceSetComputeMode(device, NVML_COMPUTEMODE_PROHIBITED);
            if (NVML_ERROR_NO_PERMISSION == result)
                printf("\t\t Need root privileges to do that: %s\n", nvmlErrorString(result));
            else if (NVML_ERROR_NOT_SUPPORTED == result)
                printf("\t\t Compute mode prohibited not supported. You might be running on\n"
                       "\t\t windows in WDDM driver model or on non-CUDA capable GPU\n");
            else if (NVML_SUCCESS != result)
            {
                printf("\t\t Failed to set compute mode for device %u: %s\n", i, nvmlErrorString(result));
                goto Error;
            }
            else
            {
                printf("\t Restoring device's compute mode back to '%s'\n",
                        convertToComputeModeString(compute_mode));
                result = nvmlDeviceSetComputeMode(device, compute_mode);
                if (NVML_SUCCESS != result)
                {
                    printf("\t\t Failed to restore compute mode for device %u: %s\n", i, nvmlErrorString(result));
                    goto Error;
                }
            }
#else
            unsigned int infoCount = 1;
            nvmlProcessInfo_t infos[8];
            result = nvmlDeviceGetComputeRunningProcesses_v2(device, &infoCount, infos);
            if (NVML_SUCCESS != result) printf("get compute running processes returned: %d, %s\n", (int)result, nvmlErrorString(result));
            else printf("infoCount = %u\n", infoCount);
#endif
        }
    }

    result = nvmlShutdown();

and then use the supplied makefile to build it, I get an output like this on a machine with a single L4 GPU, when no compute process is running on that GPU:

# ./example
Found 1 device

Listing devices:
0. NVIDIA L4 [00000000:82:00.0]
infoCount = 0
All done.
Press ENTER to continue...

OTOH if I run a trivial compute process that does a cudaSetDevice(0) and then sleep() for a number of seconds, and concurrently run the same example, I get this:

# ./example
Found 1 device

Listing devices:
0. NVIDIA L4 [00000000:82:00.0]
infoCount = 1
All done.
Press ENTER to continue...

So the mechanism seems to work for me.

(For future readers who may find my usage of the _v2 variant of the API call a bit unusual, there was a kerfuffle recently with the development path of NVML. I don’t wish to go into it here. See here for some detail. I happened to be using a 535.86.10 driver)

Eyal.Rozenberg · November 22, 2023, 5:38pm

Thanks, Robert. I guess I must have gotten something wrong in my small test program.

And this increases my motivation to finally get around to covering NVML with my API wrappers.

system · December 6, 2023, 5:38pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.