Tao toolkit container not installing

According to “tao info --verbose”, if you run with detectnet_v2 network, please use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

You can retry original command
$ tao detectnet_v2 dataset-convert -h

Tried all the way, no luck.

Nothing working.

First of all tao list command itself not showing docker list. Instead showing timeout error. Kindly refer the logs shared above.

Can you run below and share the result?
$ tao detectnet_v2 -h

:~/detectnet_v2$ tao detectnet_v2 --help
2022-06-02 18:19:45,560 [INFO] root: Registry: ['nvcr.io']
2022-06-02 18:19:45,651 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-06-02 18:19:45,990 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2022-06-02 **18:19:45**,990 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf

already it is running since 18:19:45 hrs, it is running more than 30+ minutes. But not completed or failed. it’s keep on running…

  1. Any dependency i’m missing?
  2. I can able to login nvcr[.]io registry then why not pulling the container…

output for: docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

**docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3**
v3.21.11-tf1.15.4-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Already exists
b99d76492afe: Already exists
b52f6fb756a5: Already exists
5a09baa528d6: Already exists
2df930949a05: Already exists
934eb401e46c: Downloading [===========================================>       ]  882.6MB/1.021GB
3244eb9db036: Download complete
e2e27029eb8e: Downloading [================================>                  ]  794.6MB/1.22GB
bb65579dd223: Download complete
f286d9ed18b6: Download complete
19be41539f88: Download complete
cefdfdffffa4: Downloading [======================>                            ]  858.6MB/1.9GB
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Waiting
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
80727a1dd7d9: Waiting
ad5b50ee8663: Waiting
f85d966d579f: Waiting
80b7ce251537: Waiting
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
d5204e30c651: Waiting
a5e96cc7d486: Waiting
2b4a743d384e: Waiting
2966a9405a32: Waiting
a6ee9d853f8b: Waiting
8555f3a80202: Waiting
fa03b5157f19: Waiting
53833e10ff45: Waiting
6eafdc015b75: Waiting
af9f43f64fe3: Waiting
27d6fa02bcfa: Waiting
fdaaa26bc895: Waiting
f76c248c9240: Waiting
a80d890b935f: Waiting
2e6bf771f34a: Waiting
4432665a6ac2: Waiting
91b860113bdd: Waiting
0c19e315f626: Waiting
ef3b7019500d: Waiting
08148f98f859: Waiting
3d7433a08679: Waiting
f7e8c22bc1cb: Waiting
606df40bb670: Waiting
a49d4b169dc5: Waiting
3c434869df88: Waiting
505c9a42229c: Waiting
ba767af43f7a: Waiting
f25dca5730a3: Waiting
a30ffc53d907: Waiting
61f789d0a78e: Waiting
91dc4b49789d: Waiting
c44ac0fe4f98: Waiting
92c240489b34: Waiting


Please stop it. It is not normal.

Can you pull it successfully?

Not it is not pulling containers successfully,
Command is keep on running, not completing or failing

Latest output: docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
v3.21.11-tf1.15.4-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Already exists
b99d76492afe: Already exists
b52f6fb756a5: Already exists
5a09baa528d6: Already exists
2df930949a05: Already exists
934eb401e46c: Downloading [===========================================>       ]  882.6MB/1.021GB
3244eb9db036: Download complete
e2e27029eb8e: Downloading [================================>                  ]  794.6MB/1.22GB
bb65579dd223: Download complete
f286d9ed18b6: Download complete
19be41539f88: Download complete
cefdfdffffa4: Downloading [======================>                            ]  858.6MB/1.9GB
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Waiting
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
80727a1dd7d9: Waiting
cefdfdffffa4: Downloading [=>                                                 ]  22.54MB/1.037GB
f85d966d579f: Waiting
80b7ce251537: Waiting
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
d5204e30c651: Waiting
a5e96cc7d486: Waiting
2b4a743d384e: Waiting
2966a9405a32: Waiting
a6ee9d853f8b: Waiting
8555f3a80202: Waiting
fa03b5157f19: Waiting
53833e10ff45: Waiting
6eafdc015b75: Waiting
af9f43f64fe3: Waiting
27d6fa02bcfa: Waiting
fdaaa26bc895: Waiting
f76c248c9240: Waiting
a80d890b935f: Waiting
2e6bf771f34a: Waiting
4432665a6ac2: Waiting
91b860113bdd: Waiting
0c19e315f626: Waiting
ef3b7019500d: Waiting
08148f98f859: Waiting
3d7433a08679: Waiting
f7e8c22bc1cb: Waiting
606df40bb670: Waiting
a49d4b169dc5: Waiting
3c434869df88: Waiting
505c9a42229c: Waiting
ba767af43f7a: Waiting
f25dca5730a3: Waiting
a30ffc53d907: Waiting
61f789d0a78e: Waiting
91dc4b49789d: Waiting
c44ac0fe4f98: Waiting
92c240489b34: Waiting

It is not expected. Please check your disk, network, etc. If possible, you can try to use another machine.

Or maybe the network speed is a little slow.

If network was not good it should not download from beginning, but currently something is pulling but stuck at middle. using 32GB RAM x86_64 GPU machine.

Any other way that we can manually download the container? instead of using docker pull command?

Could you try to use another machine to run “docker pull” ?

Hi @Morganh
Yes i tried in some other machine, it worked. I can able to run tao command now.

I tried detectnet_v2 dataset_convert sample, it was success but i couldn’t find the .tfrecord converted file.

dataset used: PASCAL VOC 2012 dataset → converted to KITTI format → dataset_convert_config file
data_convert_config_spec.txt (409 Bytes)

command: tao detectnet_v2 dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result --log_file /home/soundarrajan/detectnet_v2/result/dataset_convert_log.txt

Output:
dataset_convert_log.txt (4.3 KB)

.tao_mount.json file used:
.tao_mounts.json (491 Bytes)

kindly check, anything missing in config and help to fix it.

For debugging, please run into the docker and run again.
Step:
$ tao detectnet_v2 run /bin/bash

then

# dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result --log_file /home/soundarrajan/detectnet_v2/result/dataset_convert_log.txt

Hi @Morganh ,

I have tried both the way, actually the command was executed successfully as you can see the attached logs. But i couldn’t able to find the converted .tfrecord file in given output folder -o home/soundarrajan/detectnet_v2/result

In which path the converted .tfrecord will be saved to?
Kindly refer the config file and .tao_mounts.json file i attached previously.

I’m stuck at training the model. required tfrecord to proceed with training.

So kindly help to check and provide solution asap.

Please share all the log.

Already shared all the log for dataset_convert. Tao toolkit container not installing - #19 by soundarrajan

please mention which logs required now to debug?

Please follow TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation to add below in your tao_mounts.json.

    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

Hi @Morganh ,

Added docker option, tfrecord are generating successfully. But now i’m getting core dump error in training part.
Attached full logs and config file for reference.

COMMAND: tao detectnet_v2 train -k tao_encode -n detectnet_v2_resnet18 -r /home/soundarrajan/detectnet_v2/result/training -e /home/soundarrajan/detectnet_v2/config/detectnet_v2_train_config.txt --log_file /home/soundarrajan/detectnet_v2/logs/training_log.txt

ERROR:

INFO:tensorflow:Graph was finalized.
2022-06-06 12:34:52,779 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-06-06 12:34:54,226 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-06-06 12:34:54,743 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-06-06 12:35:00,359 [INFO] tensorflow: Saving checkpoints for step-0.
2022-06-06 12:35:15.946288: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.
[2e033a5e779a:00072] *** Process received signal ***
[2e033a5e779a:00072] Signal: Aborted (6)
[2e033a5e779a:00072] Signal code:  (-6)
[2e033a5e779a:00072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f2a55138f10]
[2e033a5e779a:00072] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2a55138e87]
[2e033a5e779a:00072] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2a5513a7f1]
[2e033a5e779a:00072] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x82f75b4)[0x7f29cebea5b4]
[2e033a5e779a:00072] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10CudaSolverC1EPNS_15OpKernelContextE+0x102)[0x7f29cab3d042]
[2e033a5e779a:00072] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18MatrixInverseOpGpuIfE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x147)[0x7f29ca1f9d27]
[2e033a5e779a:00072] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xeb)[0x7f29c5b0f69b]
[2e033a5e779a:00072] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf9617d)[0x7f29c5b7317d]
[2e033a5e779a:00072] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f29c5b74c6f]
[2e033a5e779a:00072] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f29c5c24791]
[2e033a5e779a:00072] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f29c5c21df8]
[2e033a5e779a:00072] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f2a530236df]
[2e033a5e779a:00072] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2a54ee26db]
[2e033a5e779a:00072] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2a5521b61f]
[2e033a5e779a:00072] *** End of error message ***
Aborted (core dumped)

training config file:
detectnet_v2_train_config.txt (11.1 KB)

training detectnet_v2 full log file:
training_log.txt (42.0 KB)

Kindly check and help to train model successfully.

Can we create a new topic? Since the original issue is resolved.

Hi @Morganh,

Sure we can close this ticket.

I have created new topic for the training core dumped error

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.