TLT training output and logs

I need to understand how to re-attach to a training process for TLT 3.0. Today I remotely started a training process and lost connection with the machine. When I reconnected I could see that the process was still executing because 1) there was a process listed in the results of “tlt list” and 2) nvidia-smi showed that the GPUs were still being used (full memory and 100% GPU usage). However tensorboard was no longer updating and I am unable to find the logs or any documentation on how to re-establish a connection to the container training. I tried using the docker log command but it only shows the beginning of the process, not the results of the training.

You can try below to re-attach the docker.
For example,

$ tlt list
============== ================== =========================
container_id container_status command
============== ================== =========================
cf3ec05452 running Not in support DNN tasks.
1bd7355433 running Not in support DNN tasks.
a71441643b running Not in support DNN tasks.
2689a9b87a running Not in support DNN tasks.
86aa84dcd4 running Not in support DNN tasks.
27d035ef2e running Not in support DNN tasks.
70ac067036 running Not in support DNN tasks.
============== ================== =========================

$ docker exec -it 1bd7355433 /bin/bash

That attached the terminal to the container such that one can execute commands within the container but it does not show the output being generated by the training process. The same problem occurs with the docker attach command.

In your case(lost connection accidently), I am afraid the log will not be saved.
When you trigger training next time, if you refer to below method , it will help you save the log.

That will not work either because the lost connection will kill the tee command. Combined with screen or tmux this could work.

Is the output of the training process logged to a file within the container somewhere?

Which network did you run? Some networks can save the logs. But most networks do not support it.

I’m experimenting with the various object detection networks. Which networks save the logs?

The object detection networks can support. (Please refer to TLT training output and logs - #15 by Morganh )

The ASR or NLP models support saving the logs.

How does one submit a feature request?

I will sync with internal team.

1 Like

One extra question, you mentioned that the training is still ongoing. Can you check if more and more new tlt models are saved in the result directory?

No new TLT models were saved before I killed the process. However this was the only process running that was using the GPUs and nvidia-smi reported that their memory was full and their utilization was 100%.

Which network did you train?

The TLT 3.0 has the feature. You can save the log during training.
See YOLO_v3 — Transfer Learning Toolkit 3.0 documentation or etc.

--log_file : The path to the log file. The default path is “stdout”.

Dectnet_v2 does not mention, but actually it can save. Will modify user guide.

1 Like