Using NVML API on HPC cluster without root access

Hi folks,

I am investigating the effect of dynamically over/underclocking the GPU during execution of some HPC code on the total energy consumption of a given process. Having achieved promising results locally (RTX A4000) where I am able to run the experimental code with sudo, I am looking to rerun the experiment on a remote node in a HPC cluster, to test it on an A100. As this is a shared system, I cannot expect to have sudo access to run large amounts of code.

Most of the NVML API documentation for the relevant functionality 1, 2, requires “root/admin” permissions.

So as I understand the documentation, the system administrator will have to run a script involving the nvmlDeviceSetAPIRestriction() to modify the permission until a system reboot will reset everything.

I have written a barebones script, but I think I must be making a mistake in how I am using one of the functions, as when run both locally and on the HPC node with sudo privileges, it returns the same error “NVML error at nvmlDeviceSetAPIRestriction: Not Supported”.

The local machine has 1x RTX A4000, and the remote node has 4x A100, they should all fully support the NVML featureset.

The script:

//compile with: nvcc modifyNVMLpermission.cu -o modifyNVMLpermission -lnvidia-ml
//run with: sudo ./modifyNVMLpermission

#include <stdio.h>
#include <nvml.h>

int main(void){
	nvmlReturn_t nvml;
	nvmlDevice_t nvmlDevice;


	nvml = nvmlInit_v2();
	if (nvml != NVML_SUCCESS){
		printf("\nNVML error at nvmlInit_v2: %s\n", nvmlErrorString(nvml));
	} else {
		printf("\nnvmlInit_v2 success\n");
	}

	int deviceNum = 0;
	nvml = nvmlDeviceGetHandleByIndex_v2 (deviceNum, &nvmlDevice );
	if (nvml != NVML_SUCCESS){
		printf("\nNVML error at nvmlDeviceGetHandleByIndex_v2: %s\n", nvmlErrorString(nvml));
	} else {
		printf("\nnvmlDeviceGetHandleByIndex_v2 success\n");
	}

	printf("\nSetting APIRestriction on device %d\n",deviceNum);
	nvml = nvmlDeviceSetAPIRestriction (nvmlDevice, NVML_RESTRICTED_API_SET_APPLICATION_CLOCKS, NVML_FEATURE_ENABLED);
	if (nvml != NVML_SUCCESS){
		printf("\nNVML error at nvmlDeviceSetAPIRestriction: %s\n", nvmlErrorString(nvml));
	} else {
		printf("\nAPI Restriction set successfully");
	}

	return 0;
}

Thank you for taking the time to read this lengthy post - and I’d be grateful for any suggestions you might have!