Tlt-converter out of memory

I am running the latest tlt-converter cuda102-trt71-jp45 with -w parameter on a Jetson Xaxier NX and I noticed that at some higher memory settings, like 3-4GB or more it ends with a memory error showing that available memory is 0.

[ERROR] Internal error: plugin node BatchedNMS requires 1029376 bytes of scratch space, but only 0 is available

The issue does not happen with the default value of 1<<30. In the past I was encountering the same problem when building models in TensorRT, when the input parameter to the program was above the max value for a 32bit integer and as a result setMaxWorkspaceSize was getting 0. Can you please verify if 64-bit integers are correctly handled in the input parameters of tlt-converter?

Can you share your command and full log when you run the tlt-coverter?

./tlt-converter -t fp16 -d 3,288,512 -k key model.etlt -o BatchedNMS -w 17179869184 -m 1
[INFO] --------------- Layers running on DLA:
[INFO] --------------- Layers running on GPU:
[INFO] conv1/convolution + conv1_mish/Relu6, maxpool_1/MaxPool, <<<TRUNCATED>>> , BatchedNMS,
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 4 output network tensors.
[ERROR] Internal error: plugin node BatchedNMSrequires 1029376 bytes of scratch space, but only 0 is available. Try increasing the workspace size with IBuilderConfig::setMaxWorkspaceSize() if using IBuilder::buildEngineWithConfig, or IBuilder::setMaxWorkspaceSize() if using IBuilder::buildCudaEngine.
[ERROR] ../builder/cudnnBuilder2.cpp (1118) - OutOfMemory Error in checkPluginScratchSize: 0
[ERROR] Unable to create engine
Segmentation fault (core dumped)

this was set for 16GB and ran on AGX with 32 GB, but same issues with lower -w

Can you try “-w 100000000” or “-w 1000000000” ?
Reference: Tutorial Spec Error: Message type "RegularizerConfig" has no field named "reg_type" - #2 by Morganh
TLT Converter Fails - #2 by Morganh

1000000000 works but that’s not even 1GB, less than the default (1<<30 = 1073741824) and still within 32bit int range. I am still getting the warning

[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output

In the past we observed that TensorRT models at higher memory (4 or 8GB) were generating slightly different results at inference

Please ignore the [INFO] log. Please check if the new etlt file is available in your directory.

Yes, new engine is there. That it is going to work at 1GB I already knew, and that’s not the issue here. The issue is, that models generated at various workspace memory settings can be different from each other, and the tool does not work correctly with larger workspaces.

@Morganh has the issue been verified as I asked?

The -w cannot set to very large. Otherwise, there is no memory space for other application.
I will verify your case later.

The default -w value is 1<<30 (i.e, 1GB) . It works for your case.
End user cannot set to a much higher value.

root@862a17075444:/workspace# tlt-converter -h
usage: tlt-converter [-h] [-v] [-e ENGINE_FILE_PATH]
[-i INPUT_ORDER] [-s] [-u DLA_CORE]

Generate TensorRT engine from exported model

positional arguments:
input_file Input file (.etlt exported model).

required flag arguments:
-d comma separated list of input dimensions(not required for TLT 3.0 new models).
-k model encoding key.

optional flag arguments:
-b calibration batch size (default 8).
-c calibration cache file (default cal.bin).
-e file the engine is saved to (default saved.engine).
-i input dimension ordering – nchw, nhwc, nc (default nchw).
-m maximum TensorRT engine batch size (default 16). If meet with out-of-memory issue, please decrease the batch size accordingly.
-o comma separated list of output node names (default none).
-p comma separated list of optimization profile shapes in the format <input_name>,<min_shape>,<opt_shape>,<max_shape>, where each shape has the format: xxx. Can be specified multiple times if there are multiple input tensors for the model. This argument is only useful in dynamic shape case.
-s TensorRT strict_type_constraints flag for INT8 mode(default false).
-t TensorRT data type – fp32, fp16, int8 (default fp32).
-u Use DLA core N for layers that support DLA(default = -1, which means no DLA core will be utilized for inference. Note that it’ll always allow GPU fallback).
-w maximum workspace size of TensorRT engine (default 1<<30). If meet with out-of-memory issue, please increase the workspace size accordingly.