Hi folks,
I am investigating the effect of dynamically over/underclocking the GPU during execution of some HPC code on the total energy consumption of a given process. Having achieved promising results locally (RTX A4000) where I am able to run the experimental code with sudo, I am looking to rerun the experiment on a remote node in a HPC cluster, to test it on an A100. As this is a shared system, I cannot expect to have sudo access to run large amounts of code.
Most of the NVML API documentation for the relevant functionality 1, 2, requires “root/admin” permissions.
So as I understand the documentation, the system administrator will have to run a script involving the nvmlDeviceSetAPIRestriction() to modify the permission until a system reboot will reset everything.
I have written a barebones script, but I think I must be making a mistake in how I am using one of the functions, as when run both locally and on the HPC node with sudo privileges, it returns the same error “NVML error at nvmlDeviceSetAPIRestriction: Not Supported”.
The local machine has 1x RTX A4000, and the remote node has 4x A100, they should all fully support the NVML featureset.
The script:
//compile with: nvcc modifyNVMLpermission.cu -o modifyNVMLpermission -lnvidia-ml
//run with: sudo ./modifyNVMLpermission
#include <stdio.h>
#include <nvml.h>
int main(void){
nvmlReturn_t nvml;
nvmlDevice_t nvmlDevice;
nvml = nvmlInit_v2();
if (nvml != NVML_SUCCESS){
printf("\nNVML error at nvmlInit_v2: %s\n", nvmlErrorString(nvml));
} else {
printf("\nnvmlInit_v2 success\n");
}
int deviceNum = 0;
nvml = nvmlDeviceGetHandleByIndex_v2 (deviceNum, &nvmlDevice );
if (nvml != NVML_SUCCESS){
printf("\nNVML error at nvmlDeviceGetHandleByIndex_v2: %s\n", nvmlErrorString(nvml));
} else {
printf("\nnvmlDeviceGetHandleByIndex_v2 success\n");
}
printf("\nSetting APIRestriction on device %d\n",deviceNum);
nvml = nvmlDeviceSetAPIRestriction (nvmlDevice, NVML_RESTRICTED_API_SET_APPLICATION_CLOCKS, NVML_FEATURE_ENABLED);
if (nvml != NVML_SUCCESS){
printf("\nNVML error at nvmlDeviceSetAPIRestriction: %s\n", nvmlErrorString(nvml));
} else {
printf("\nAPI Restriction set successfully");
}
return 0;
}
Thank you for taking the time to read this lengthy post - and I’d be grateful for any suggestions you might have!