Exit code zero on error

pwoolvett · August 20, 2021, 2:30pm

• Hardware (RTX 2080)
• Network Type: Detectnet_v2
• TLT Version v3.0-py3
• How to reproduce the issue ?
The container is failing, but tlt does not forward its exit code from the docker container. This is not specific for detectnet_v2.

$ tlt detectnet_v2 export adsfg
2021-08-20 10:27:13,451 [INFO] root: Registry: ['nvcr.io']
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-ee0f19zj because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
usage: detectnet_v2 export [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                           [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                           [--log_file LOG_FILE] -m MODEL -k KEY
                           [-o OUTPUT_FILE] [--force_ptq]
                           [--cal_data_file CAL_DATA_FILE]
                           [--cal_image_dir CAL_IMAGE_DIR]
                           [--data_type {fp32,fp16,int8}] [-s]
                           [--gen_ds_config] [--cal_cache_file CAL_CACHE_FILE]
                           [--batches BATCHES]
                           [--max_workspace_size MAX_WORKSPACE_SIZE]
                           [--max_batch_size MAX_BATCH_SIZE]
                           [--batch_size BATCH_SIZE] [-e EXPERIMENT_SPEC]
                           [--engine_file ENGINE_FILE]
                           [--static_batch_size STATIC_BATCH_SIZE] [-v]
                           {calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}
                           ...
detectnet_v2 export: error: invalid choice: 'adsfg' (choose from 'calibration_tensorfile', 'dataset_convert', 'evaluate', 'export', 'inference', 'prune', 'train')
2021-08-20 10:27:23,086 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

$ echo $?
0

Morganh · August 21, 2021, 2:58pm

Please login the docker and run.
See below logs.

$ tlt detectnet_v2 run /bin/bash
2021-08-21 22:56:33,225 [INFO] root: Registry: [‘nvcr.io’]
2021-08-21 22:56:37,335 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
root@6652cb2bc469:/workspace# detectnet_v2 export -h
Using TensorFlow backend.
usage: detectnet_v2 export [-h] [–num_processes NUM_PROCESSES] [–gpus GPUS]
[–gpu_index GPU_INDEX [GPU_INDEX …]] [–use_amp]
[–log_file LOG_FILE] -m MODEL -k KEY
[-o OUTPUT_FILE] [–force_ptq]
[–cal_data_file CAL_DATA_FILE]
[–cal_image_dir CAL_IMAGE_DIR]
[–data_type {fp32,fp16,int8}] [-s]
[–gen_ds_config] [–cal_cache_file CAL_CACHE_FILE]
[–batches BATCHES]
[–max_workspace_size MAX_WORKSPACE_SIZE]
[–max_batch_size MAX_BATCH_SIZE]
[–batch_size BATCH_SIZE] [-e EXPERIMENT_SPEC]
[–engine_file ENGINE_FILE]
[–static_batch_size STATIC_BATCH_SIZE] [-v]
{calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}
…

optional arguments:
-h, --help show this help message and exit
–num_processes NUM_PROCESSES, -np NUM_PROCESSES
The number of horovod child processes to be spawned.
Default is -1(equal to --gpus).
–gpus GPUS The number of GPUs to be used for the job.
–gpu_index GPU_INDEX [GPU_INDEX …]
The indices of the GPU’s to be used.
–use_amp Flag to enable Auto Mixed Precision.
–log_file LOG_FILE Path to the output log file.
-m MODEL, --model MODEL
Path to the model file.
-k KEY, --key KEY Key to load the model.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file (defaults to $(input_filename).etlt)
–force_ptq Flag to force post training quantization for QAT
models.
–cal_data_file CAL_DATA_FILE
Tensorfile to run calibration for int8 optimization.
–cal_image_dir CAL_IMAGE_DIR
Directory of images to run int8 calibration if data
file is unavailable
–data_type {fp32,fp16,int8}
Data type for the TensorRT export.
-s, --strict_type_constraints
Apply TensorRT strict_type_constraints or not for INT8
mode.
–gen_ds_config Generate a template DeepStream related configuration
elements. This config file is NOT a complete
configuration file and requires the user to update the
sample config files in DeepStream with the parameters
generated from here.
–cal_cache_file CAL_CACHE_FILE
Calibration cache file to write to.
–batches BATCHES Number of batches to calibrate over.
–max_workspace_size MAX_WORKSPACE_SIZE
Max size of workspace to be set for TensorRT engine
builder.
–max_batch_size MAX_BATCH_SIZE
Max batch size for TensorRT engine builder.
–batch_size BATCH_SIZE
Number of images per batch.
-e EXPERIMENT_SPEC, --experiment_spec EXPERIMENT_SPEC
Path to the experiment spec file.
–engine_file ENGINE_FILE
Path to the exported TRT engine.
–static_batch_size STATIC_BATCH_SIZE
Set a static batch size for exported etlt model.
Default is -1(dynamic batch size).
-v, --verbose Verbosity of the logger.

tasks:
{calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}
root@6652cb2bc469:/workspace# echo $?
0
root@6652cb2bc469:/workspace# detectnet_v2 export abcd
Using TensorFlow backend.
usage: detectnet_v2 export [-h] [–num_processes NUM_PROCESSES] [–gpus GPUS]
[–gpu_index GPU_INDEX [GPU_INDEX …]] [–use_amp]
[–log_file LOG_FILE] -m MODEL -k KEY
[-o OUTPUT_FILE] [–force_ptq]
[–cal_data_file CAL_DATA_FILE]
[–cal_image_dir CAL_IMAGE_DIR]
[–data_type {fp32,fp16,int8}] [-s]
[–gen_ds_config] [–cal_cache_file CAL_CACHE_FILE]
[–batches BATCHES]
[–max_workspace_size MAX_WORKSPACE_SIZE]
[–max_batch_size MAX_BATCH_SIZE]
[–batch_size BATCH_SIZE] [-e EXPERIMENT_SPEC]
[–engine_file ENGINE_FILE]
[–static_batch_size STATIC_BATCH_SIZE] [-v]
{calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}
…
detectnet_v2 export: error: invalid choice: ‘abcd’ (choose from ‘calibration_tensorfile’, ‘dataset_convert’, ‘evaluate’, ‘export’, ‘inference’, ‘prune’, ‘train’)
root@6652cb2bc469:/workspace# echo $?
2
root@6652cb2bc469:/workspace#

pwoolvett · August 23, 2021, 7:51pm

Hi @Morganh

Indeed, as you suggest, the internal exit code for the command from inside the docker container is non zero on failure, as expected.

The problem is that the exit code for the tlt command (the python package) does not have the same exit code.

# tlt.components.docker_handler.docker_handler:DockerHandler.run_container
        try:
            subprocess.check_call(
                formatted_command,
                shell=True,
                stdout=sys.stdout,
                env=os.environ
            )
        except subprocess.CalledProcessError as e:
            if e.output is not None:
                print("TLT command run failed with error: {}".format(e.output))
                if self._container:
                    logger.info("Stopping container post instantiation")
                    self.stop_container()
                sys.exit(-1)
        # THIS IS NOT HANDLED: subprocess.CalledProcessError when e.output is None
        # maybe its enough to unindent once the sys.exit...
        finally:
            if self._container:
                logger.info("Stopping container.")
                self.stop_container()

Morganh · August 29, 2021, 4:44pm

You can use below way instead of “echo $?” . Modify docker_handler.py as below.

$ vim venv_3.0/lib/python3.6/site-packages/tlt/components/docker_handler/docker_handler.py

    except subprocess.CalledProcessError as e:
            print("{}".format(e))
            if e.output is not None:

Then, there is below result.

ssd export: error: invalid choice: ‘adsfg’ (choose from ‘dataset_convert’, ‘evaluate’, ‘export’, ‘inference’, ‘prune’, ‘train’)
Command ‘bash -c ‘docker exec -it 84b3ffc1aa31713d3f83400cf1f81ede9c902586c15bbb338673c7ffe7f99c78 ssd export adsfg’’ returned non-zero exit status 2.
2021-08-30 00:43:52,731 [INFO] tlt.components.docker_handler.docker_handler: morganStopping container.

pwoolvett · August 30, 2021, 2:31pm

Hi, thanks again for your feedback.
We are using tlt as part of a pipeline, so the exit code is as important as stdot/stderr - otherwise the next steps would continue running…

For the moment we are using this (equivalent ot your suggestion, but including the exit code):

        except subprocess.CalledProcessError as e:
            msg = "TLT command run failed "
            if e.output is not None:
                msg += f"with error: {}".format(e.output))
            else:
                msg += "without output"
            print(msg)
            if self._container:
                logger.info("Stopping container post instantiation")
                self.stop_container()
            sys.exit(-1)

Note we’re changing the logic here as previously the sys.exit(-1) would not be called if e.output=None, but now it will.

Is the current behavior the intended one? Is there a scenario where not exiting with -1, although the container failed is expected?

Morganh · August 30, 2021, 3:03pm

I think you can go further with your modification. I do not find similar scenario.

Topic		Replies	Views
Container Exit Immediately without any Error Message when Training/Evaluating Models TAO Toolkit	2	437	September 27, 2022
TLT 3.0 Container Error while Convert to TFRecord TAO Toolkit	4	651	September 11, 2021
Error while executing "tlt-export" command inside docker "tlt-streamanalytics: v3.0-dp-py3" TAO Toolkit	6	1001	October 12, 2021
Tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit tao , deepstream61	3	530	August 1, 2022
TLT mask rcnn error: Tlt.components.docker_handler.docker_handler: Stopping container TAO Toolkit	23	2310	October 12, 2021
tlt-export error TAO Toolkit	3	1661	October 12, 2021
Problem about installing TLT TAO Toolkit	9	1435	October 12, 2021
The input device is not a TTY TAO Toolkit	19	1894	October 12, 2021
Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it? TAO Toolkit	20	2102	August 24, 2021
Launch tlt detectnet_v2 TAO Toolkit	4	1121	October 12, 2021

Exit code zero on error

Related topics