Please provide the following information when requesting support.
• Hardware: Nvidia DGX A100
• Network Type SSD
• TAO 4.0.1 (4.0.2 Helm Chart )
- NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0
Hi, I have been running an AutoML job for a few hours until one experiment got an error and training was aborted (no gpu activity )
however the in the jobs_metadata/job-id.json
file (in the k8 pv {k8-pv-root}/users/user-name/models/model-name/jobs_metadata/job-id.json
)
{
"id": "8d04c165-9c00-4a42-87b6-b99978e314e4",
"parent_id": null,
"action": "train",
"created_on": "2023-06-28T17:24:32.019439",
"last_modified": "2023-06-28T17:24:32.028681",
"status": "Running",
"result": {}
}
the workflow seems to think the job is running.
even though the failed experiment log
log.txt (93.0 KB)
says it has failed
Current pipeline object is no longer valid.
[[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[43892,1],3]
Exit code: 1
--------------------------------------------------------------------------
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
Note: “Per user-direction” doesn’t make sense because I did not do this (this happened during the night while i was away) is this the main process that supervise the training flow?
however I was able to “stop the job” (with the aim to resume)
with the command endpoint = f"{base_url}/model/{model_ID}/job/{job_id}/cancel"
and resume with endpoint = f"{base_url}/model/{model_ID}/job/{job_id}/resume"
and I can see that a new experiment has been started
before that (before I started and stopped the job via API calls) it was like this
and my ClearML dashboard also confirms with (gpu activity ramping up)
** Note: please note that the “_experiment_5” in clearml screenshot is a number I put and not referring to “experiment_0 to experiment_4 in the tao toolkit pv screenshots in other pictures” **
My question is:
Is this going to affect the AutoML job (because there is an “experiment_3” folder in the job directory that has not completed successfully )? because previously I have experienced that when consecutive jobs start (e.g. rather than getting queued the second job starts and the first job halts but in job metadata is stuck in the running state) even in this case when the ephemeral pods were removed after the stop command the metadata said the job was still running and API calls returned the status as running.
second question : if it is fine (what I have done (stopping and resuming via API calls)) is that the recommended way to tackle a problem like this?