Please provide the following information when requesting support.
• Hardware (GeForce 3080Ti)
• Network Type (Detectnet_v2)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(detectnet_v2_train_resnet18_kitti.txt)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
When I use detectnet_v2.ipynb to train data, I have a errr, Please help me.
detectnet_v2_train_resnet18_kitti.txt (5.4 KB)
log.txt (52.3 KB)
Please double check.
And also monitor the “$nvidia-smi” to check the GPU memory consumption.
When th docker stop, th nvidi-smi is:
Please double check.?
Could you tell me whic file to check?
I mean you can run again to double confirm.
Before running again, please run below in terminal.
$ nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1
I test it many times, but still have this error, the ‘nvidia-smi --query-gpu=memory.used,memory.total --format=csv -i 0 -l 1’ result is:
When I run tao training, it shows this error.
You can ignore. It is not an error.
Can you share more info?
What is the $NUM_GPUS
Can you share the log when try to run again with a new folder " -r $USER_EXPERIMENT_DIR/new_folder " ?
$NUM_GPUS=1
log.txt (52.7 KB)
Please open a terminal to debug as below.
$ tao detectnet_v2 run /bin/bash
then
#
detectnet_v2 train xxx
Morganh:
detectnet_v2 train
The log is:
log.txt (45.9 KB)
Please use a new folder and share the log. Thanks a lot.
-r tao-experiments/detectnet_v2/experiment_dir_unpruned_new
OK, this is the log, thank you very much
log.txt (49.6 KB)
but there is a question, a few days later, this tao training is success, but when I want to try again today, it has this error.
OK, thanks for the info. May I know that if you ever know below change and workaround?
Yes, there is a update from ngccli which results in the issue when trigger TAO docker. We’re unable to install the CLI via the entrypoint since the relative path of the binary in the zip file has changed. Internally team will update the launcher with a fix. Currently, please use below workarounds.
Please refer to below 1st workaround.
Just add this: --entrypoint ""
For example,
$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-p…
This had solved my error, thank you very much.
But when I meet this similar problem, where can I find this answer?
You can create a topic in this forum. Thanks.