• Hardware (RTX 3070Ti / RTX 3090 / RTX 3080Ti/A600] Tested on these 4 hardware.
• Network Type (Mask_cnn)
• TLT Version: nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
• How to reproduce the issue ?
I am facing an issue with the conversion of the mask_rcnn .etlt model to the .engine file.
I trained a model on a 3090 PC and exported and converted it successfully on the machine to an engine file without a problem.
After that, I transferred the exported model ( .etlt ) to my laptop and tried to convert it to the engine. However, I fail to do so. Apparently, there aren’t any logs to help me with the issue.
I ran the same command with the same weight on 2 PCs ( 3090/A6000) and both convert the model successfully. When I ran the same command with the same weight on 2 laptops ( 3070Ti/3080Ti ) conversion fails on both.
I am attaching the error log and command I use to convert below:
command:
!tao converter -k nvidia_tlt \
-d 3,832,1344 \
-o generate_detections,mask_fcn_logits/BiasAdd \
-e /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/export/trt.fp16.engine \
-t fp16 \
-i nchw \
-m 1 \
/workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/model.step-32400.etlt
logs on PCs
[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 1031 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 1031 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +750, GPU +318, now: CPU 1669, GPU 1349 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 1617 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 248000
[INFO] Total Device Persistent Memory: 84687872
[INFO] Total Scratch Memory: 53721600
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 162 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3309, GPU 2113 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3309, GPU 2121 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3309, GPU 2109 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3308, GPU 2093 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 3237 MiB, GPU 2093 MiB
2022-05-26 19:21:20,637 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
logs on laptops:
INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 417 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 417 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1669, GPU 735 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 1003 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
2022-05-26 19:24:19,300 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
As shown the logs don’t really show anything. I tried this on diff laptops and all led to a similar result.
Things I have tried:
- Made sure -K key is correct, the path to etlt model is correct and mapped properly.
- tried with -s parameter ( same result )
- adjusted -w parameter but no difference.
- tried with -v for verbose log but got error of no-arg -v
I need to convert this model on the laptops as well for inference, please help.
PS: I intentionally messed up the path to etlt model so I can be sure of the error log. If I mess up etlt path I get this:
Unsupported number of graph 0
[ERROR] Failed to parse the model, please check the encoding key to make sure it's correct
[ERROR] 4: [network.cpp::validate::2411] Error Code 4: Internal Error (Network must have at least one output)
[ERROR] Unable to create engine
2022-05-26 20:38:03,639 [INFO] tlt.components.docker_handler.docker_handler: Stopping container
But this is not happening with me with the correct etlt path so that’s also not an issue. This feels like a weird problem.