Tao toolkit container not installing

I’m stuck at training the model. required tfrecord to proceed with training.

So kindly help to check and provide solution asap.

Please share all the log.

Already shared all the log for dataset_convert. Tao toolkit container not installing - #19 by soundarrajan

please mention which logs required now to debug?

Please follow TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation to add below in your tao_mounts.json.

    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

Hi @Morganh ,

Added docker option, tfrecord are generating successfully. But now i’m getting core dump error in training part.
Attached full logs and config file for reference.

COMMAND: tao detectnet_v2 train -k tao_encode -n detectnet_v2_resnet18 -r /home/soundarrajan/detectnet_v2/result/training -e /home/soundarrajan/detectnet_v2/config/detectnet_v2_train_config.txt --log_file /home/soundarrajan/detectnet_v2/logs/training_log.txt

ERROR:

INFO:tensorflow:Graph was finalized.
2022-06-06 12:34:52,779 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-06-06 12:34:54,226 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-06-06 12:34:54,743 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-06-06 12:35:00,359 [INFO] tensorflow: Saving checkpoints for step-0.
2022-06-06 12:35:15.946288: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.
[2e033a5e779a:00072] *** Process received signal ***
[2e033a5e779a:00072] Signal: Aborted (6)
[2e033a5e779a:00072] Signal code:  (-6)
[2e033a5e779a:00072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f2a55138f10]
[2e033a5e779a:00072] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2a55138e87]
[2e033a5e779a:00072] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2a5513a7f1]
[2e033a5e779a:00072] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x82f75b4)[0x7f29cebea5b4]
[2e033a5e779a:00072] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10CudaSolverC1EPNS_15OpKernelContextE+0x102)[0x7f29cab3d042]
[2e033a5e779a:00072] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18MatrixInverseOpGpuIfE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x147)[0x7f29ca1f9d27]
[2e033a5e779a:00072] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xeb)[0x7f29c5b0f69b]
[2e033a5e779a:00072] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf9617d)[0x7f29c5b7317d]
[2e033a5e779a:00072] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f29c5b74c6f]
[2e033a5e779a:00072] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f29c5c24791]
[2e033a5e779a:00072] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f29c5c21df8]
[2e033a5e779a:00072] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f2a530236df]
[2e033a5e779a:00072] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2a54ee26db]
[2e033a5e779a:00072] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2a5521b61f]
[2e033a5e779a:00072] *** End of error message ***
Aborted (core dumped)

training config file:
detectnet_v2_train_config.txt (11.1 KB)

training detectnet_v2 full log file:
training_log.txt (42.0 KB)

Kindly check and help to train model successfully.

Can we create a new topic? Since the original issue is resolved.

Hi @Morganh,

Sure we can close this ticket.

I have created new topic for the training core dumped error

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.