The vGPU Software Licensed Product goes into an Unlicensed state when provisioning a kubernetes node on Azure Kubernetes Service. This seems to correspond to intermittent GPU failures on the node. What is the best way to fix this? And what is the cause for this license state to have an issue?
Failure logs:
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
**RuntimeError: CUDA error: device doesn't have valid Grid license**
This is for GRID driver version:
Driver Version: 550.144.03
CUDA Version : 12.4
GPU used: NVIDIA A10-4Q, NVIDIA RTX Virtual Workstation through Azure’s Standard_NV6ads_A10_v5 (NVadsA10_v5 size series - Azure Virtual Machines | Microsoft Learn)
Temporary fix:
Restarting the nvidia-gridd process running on the node fixes it and sets the License status back to Licensed.
sudo pkill nvidia-gridd
sudo /usr/bin/nvidia-gridd &
Although I did notice that license server being in an Unlicensed state does not always break workloads.
I ran a sample tensorflow training job that uses CUDA and it worked. Screenshots below show the A10 node in an Unlicensed state and the training pod using GPUs. The pod spec for the tensorflow job is here: Use GPUs on Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn
we seem to face the same issues.
any updates?
I have the same issue:
when I run the vectorAdd, it returns
[Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA-capable device(s) is/are busy or unavailable)!
when I run the pod spec for the tensorflow job:
2025-03-19 04:11:19.055372: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2025-03-19 04:11:19.137912: E tensorflow/core/common_runtime/direct_session.cc:170] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_VALUE
Traceback (most recent call last):
File "/app/main.py", line 212, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/app/main.py", line 185, in main
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
train()
File "/app/main.py", line 152, in train
sess = tf.InteractiveSession()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1612, in __init__
super(InteractiveSession, self).__init__(target, graph, config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 622, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.