Please provide the following information when requesting support.
• Hardware DGX Station A100 with k8
• Network Type SSD
• TLT Version 4.0.0 | kubectl version v1.24.14
Create K8 cluster with helm chart (one master node CPUserver, DGX Station with gpu-operator), folllow the instructions in ssd (with TAO API) until model creation and retraining.
Optionally I’ve used clearml local server for visualisation,
Everything runs well for a while but after a while (maybe after 2 runs) jobs get stuck in the pending state, I’ve tried creating new models, but ithe problem doesn;t go away.
I can set a task from pending to stopped but then the status goes to Error (instead of stopped) Then I can delete the task but after that all new tasks get stuck in the pending state.
Is there a way to check what is happenning (`kubectl logs command doesn’t give enough info)
Right ow the solution is to remove and redeploy the helm charts. Is this a known problem? os there a workaround?
# list the current jobs
endpoint = f"{base_url}/model/{model_ID}/job"
response = requests.get(endpoint, headers=headers, verify=rootca)
# print(response)
for job in response.json():
table_headers = ["job id", "action", "status", "result", "epoch", "t_epoch" ,"message", "date"]
if "detailed_status" in job['result']:
# print(json.dumps(job, sort_keys=True, indent=4))
data = [(job['id'],
job['action'],
job['status'],
job['result']['detailed_status']['status'],
f"{job['result']['epoch']}/{job['result']['max_epoch']}",
job['result']['time_per_epoch'],
job['result']['detailed_status']['message'],
f"[{job['result']['detailed_status']['date']}][{job['result']['detailed_status']['time']}]")]
print(tabulate(data, headers=table_headers, tablefmt="grid"))
else:
table_headers = ["job id", "action", "status"]
data = [(job['id'], job['action'], job['status'])]
print(tabulate(data, headers=table_headers, tablefmt="grid"))
output:
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
| job id | action | status | result | epoch | t_epoch | message | date |
+======================================+==========+==========+==========+=========+================+=================================+=======================+
| 675518bf-86eb-47ee-ba2e-b1088f3ba08e | train | Done | SUCCESS | 6/6 | 0:02:05.933211 | Training finished successfully. | [6/12/2023][12:45:55] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| 6aa7b64e-7b7d-4eaf-b036-9b42365846d7 | train | Pending |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+----------------------+
| job id | action | status | result | epoch | t_epoch | message | date |
+======================================+==========+==========+==========+=========+================+=================================+======================+
| 2b827b58-cc53-4141-a6fb-50dc4787b784 | train | Done | SUCCESS | 6/6 | 0:02:05.295144 | Training finished successfully. | [5/4/2023][17:47:29] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+----------------------+
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
| job id | action | status | result | epoch | t_epoch | message | date |
+======================================+==========+==========+==========+=========+================+=================================+=======================+
| 310a19dc-1925-4cf0-bd4f-ad9dae50efb3 | train | Done | SUCCESS | 20/20 | 0:02:13.025302 | Training finished successfully. | [6/13/2023][13:51:41] |
+--------------------------------------+----------+----------+----------+---------+----------------+---------------------------------+-----------------------+
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| ee58cad8-7cc6-47ae-8c61-c38ba08b93f4 | train | Pending |
+--------------------------------------+----------+----------+
After begining from a new model now everything is in the pending state
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| 109422cc-c8df-4231-be36-17be63116e74 | train | Pending |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| 5a2c225a-1923-4ecb-895a-999e6e1e2d29 | train | Pending |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| 0c4f97c9-93f0-422e-9ebb-837f5119f6d1 | train | Pending |
+--------------------------------------+----------+----------+
+--------------------------------------+----------+----------+
| job id | action | status |
+======================================+==========+==========+
| 971fe00a-5110-4d91-b2a6-5fb4e550b179 | train | Error |
+--------------------------------------+----------+----------+
I doublechecked and made sure that nothing is running in the background (GPU intensive such as training )
Tue Jun 13 19:22:30 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 31C P0 51W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 32C P0 52W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 31C P0 53W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display On | 00000000:C1:00.0 Off | N/A |
| 34% 39C P8 N/A / 50W | 6MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C2:00.0 Off | 0 |
| N/A 31C P0 51W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6099 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 6099 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 6099 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 6099 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 6099 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
g@dgx:~$
No
Cheers