TRT engine returns nan on jetson nano

Hi! Im trying to implement inference using citrinet model for jetson nano. After exporting model nemo → onnx → tensorrt the output of trt model on x86 PC doesn’t equal the output on jetson nano.
Both trt models have the same precition and workspace .
My model shapes:

binding 0 - 'audio_signal'
   input:    True
   shape:    (1, 80, -1)
   dtype:    DataType.FLOAT
   size:     -320
   dynamic:  True
   profiles: [{'min': (1, 80, 10), 'opt': (1, 80, 150), 'max': (1, 80, 300)}]


binding 1 - 'length'
   input:    True
   shape:    (1,)
   dtype:    DataType.INT32
   size:     4
   dynamic:  False
   profiles: [{'min': (1,), 'opt': (1,), 'max': (1,)}]


binding 2 - 'logprobs'
   input:    False
   shape:    (1, -1, 1025)
   dtype:    DataType.FLOAT
   size:     -4100
   dynamic:  True
   profiles: []

When I put this tensors to input layer:

torch.Size([1, 80, 151]) - audio_signal
tensor([[[-0.3492, -0.3492, -0.3492,  ...,  4.5867,  4.7051,  0.0000],
         [-0.3971, -0.3971, -0.3971,  ...,  2.7422,  2.7433,  0.0000],
         [-0.3963, -0.3963, -0.3963,  ...,  2.1081,  2.1132,  0.0000],
         ...,
         [-0.4859, -0.4859, -0.4859,  ...,  2.3163,  1.9124,  0.0000],
         [-0.4717, -0.4717, -0.4717,  ...,  2.5303,  1.0611,  0.0000],
         [-0.4861, -0.4861, -0.4861,  ...,  1.7891,  1.3740,  0.0000]]])
torch.Size([1]) - length
tensor([150])

the output on PC is:

[[2.43806308e-08 1.67615531e-06 1.31286612e-07 ... 2.07928537e-08
  4.35268888e-08 9.99820173e-01]
 [2.12243876e-08 1.40326529e-06 1.04342654e-07 ... 2.41917171e-08
  4.14337720e-08 9.99853849e-01]
 [2.73860437e-08 1.28268800e-06 1.74102283e-07 ... 4.25407798e-08
  5.49981323e-08 9.99853611e-01]
 ...
 [9.64823510e-09 2.46931700e-06 1.78903429e-06 ... 7.19314357e-07
  9.83519470e-08 9.98588264e-01]
 [8.09493415e-08 1.92733733e-05 5.31556134e-06 ... 4.09953236e-06
  1.16596459e-06 9.96734917e-01]
 [9.93186688e-08 1.32259365e-05 1.55816360e-06 ... 8.91289631e-07
  1.00996351e-06 9.99091506e-01]]

but when I run this code on jetson nano I always get nan:

[[nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 ...
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]
 [nan nan nan ... nan nan nan]]

trt_builder.py has the same code for PC and Jetson nano. After building TRT engines my PC version is 333MB (fp16), jetson nano - 1104MB(fp16).
I really have no idea why jetson nano model doesn’t work as expected

Some piece of code:

#wav to spectrogram
preprocessed_signal, audio_length = self.preprocessor(
            input_signal=torch.as_tensor(self.buffer, dtype=torch.float32).unsqueeze(dim=0), 
            length=torch.as_tensor(self.buffer.size, dtype=torch.int32).unsqueeze(dim=0)
        )
logits = self.model.execute((torch_to_numpy(preprocessed_signal),torch_to_numpy(audio_length)))

function execute from TRTmodel class:

    def execute(self, inputs, sync=True, return_dict=False, **kwargs):
        """
        Run the DNN model in TensorRT.  The inputs are provided as numpy arrays in a list/tuple/dict.
        Note that run() doesn't perform any pre/post-processing - this is typically done in subclasses.
        
        Parameters:
          inputs (array, list[array], dict[array]) -- the network inputs as numpy array(s).
                         If there is only one input, it can be provided as a single numpy array.
                         If there are multiple inputs, they can be provided as numpy arrays in a
                         list, tuple, or dict.  Inputs in lists and tuples are assumed to be in the
                         same order as the input bindings.  Inputs in dicts should have keys with the
                         same names as the input bindings.
          sync (bool) -- If True (default), will wait for the GPU to be done processing before returning.
          return_dict (bool) -- If True, the results will be returned in a dict of numpy arrays, where the
                                keys are the names of the output binding names. By default, the results will 
                                be returned in a list of numpy arrays, in the same order as the output bindings.
          
        Returns the model output as a numpy array (if only one output), list[ndarray], or dict[ndarray].
        """
        if isinstance(inputs, np.ndarray):
            inputs = [inputs]
        
        assert len(inputs) == len(self.inputs)
        
        # setup inputs + copy to GPU
        def setup_binding(binding, input):
            input = input.astype(trt.nptype(binding.dtype), copy=False)
            if binding.dynamic: 
                binding.set_shape(input.shape)
            cuda.memcpy_htod_async(binding.device, np.ascontiguousarray(input), self.stream)
            
        if isinstance(inputs, (list,tuple)):
            for idx, input in enumerate(inputs):
                setup_binding(self.bindings[idx], input)
        elif isinstance(inputs, dict):        
            for binding_name in inputs:
                setup_binding(self.find_binding(binding_name), inputs[binding_name])
        else:
            raise ValueError(f"inputs must be a list, tuple, or dict (instead got type '{type(inputs).__name__}')")
            
        assert self.trt_context.all_binding_shapes_specified
        assert self.trt_context.all_shape_inputs_specified 
        
        # query new dynamic output shapes
        for output in self.outputs:
            output.query_shape()

        # run inference
        self.trt_context.execute_async_v2(
            bindings=[int(binding.device) for binding in self.bindings], 
            stream_handle=self.stream.handle
        )
          
        # copy outputs to CPU
        for output in self.outputs:
            cuda.memcpy_dtoh_async(output.host, output.device, self.stream)
          
        # wait for completion
        if sync:
            self.stream.synchronize()
            
        # return results
        if return_dict:
            results = {}
            for output in self.outputs:
                results[output.name] = output.host
            return results
        else:
            if len(self.outputs) == 1:
                return self.outputs[0].host
            else:
                return tuple([output.host for output in self.outputs])

my jetpack version:

Package: nvidia-jetpack
Version: 4.6-b199
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b199), nvidia-opencv (= 4.6-b199), nvidia-cudnn8 (= 4.6-b199), nvidia-tensorrt (= 4.6-b199), nvidia-visionworks (= 4.6-b199), nvidia-container (= 4.6-b199), nvidia-vpi (= 4.6-b199), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_4.6-b199_arm64.deb
Size: 29368
SHA256: 69df11e22e2c8406fe281fe6fc27c7d40a13ed668e508a592a6785d40ea71669
SHA1: 5c678b8762acc54f85b4334f92d9bb084858907a
MD5sum: 1b96cd72f2a434e887f98912061d8cfb
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

Package: nvidia-jetpack
Version: 4.6-b197
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-cuda (= 4.6-b197), nvidia-opencv (= 4.6-b197), nvidia-cudnn8 (= 4.6-b197), nvidia-tensorrt (= 4.6-b197), nvidia-visionworks (= 4.6-b197), nvidia-container (= 4.6-b197), nvidia-vpi (= 4.6-b197), nvidia-l4t-jetson-multimedia-api (>> 32.6-0), nvidia-l4t-jetson-multimedia-api (<< 32.7-0)
Homepage: http://developer.nvidia.com/jetson
Priority: standard
Section: metapackages
Filename: pool/main/n/nvidia-jetpack/nvidia-jetpack_4.6-b197_arm64.deb
Size: 29356
SHA256: 104cd0c1efefe5865753ec9b0b148a534ffdcc9bae525637c7532b309ed44aa0
SHA1: 8cca8b9ebb21feafbbd20c2984bd9b329a202624
MD5sum: 463d4303429f163b97207827965e8fe0
Description: NVIDIA Jetpack Meta Package
Description-md5: ad1462289bdbc54909ae109d1d32c0a8

Both codes (PC and Jetson) run in dockers

Dear @kurkovpavel,
I assume both host and target have same TRT version.
Is it possible to check with latest jetpack release? If you notice the same with latest release, please share model and complete repro steps on x86 and target.

As far I understand L4T R32.6.1 is the latest release for jetson nano, my Tensorrt version is 8.0.1.6. I’m not sure if there is a newer binary Tensorrt for arch64, anyway I’ll try to build from source… On x86 I tried 7.2.1.6 and 8.0.1.6 - both versions work fine with my engline. If I use onnxruntime as model engine my code on jetson nano works as expected too.
I’ll let you know as I update tensorrt, thank you

Dear @kurkovpavel,
As far I understand L4T R32.6.1 is the latest release for jetson nano

Please check with Jetpack 4.6.3

I upgraded jetpack, now I have next env versions:

NVIDIA NVIDIA Jetson Nano Developer Kit
 L4T 32.7.3 [ JetPack 4.6.3 ]
   Ubuntu 18.04.6 LTS
   Kernel Version: 4.9.253-tegra
 CUDA 10.2.300
   CUDA Architecture: 5.3
 OpenCV version: 4.1.1
   OpenCV Cuda: NO
 CUDNN: 8.2.1.32
 TensorRT: 8.2.1.9
 Vision Works: 1.6.0.501
 VPI: 1.2.3
 Vulcan: 1.2.70

I built TRT engine from onnx and now it works on jetson nano!

torch.Size([1, 80, 151]) - audio_signal
tensor([[[-0.3477, -0.3477, -0.3477,  ...,  4.0284,  4.1334,  5.7149],
         [-0.4060, -0.4060, -0.4060,  ...,  2.6393,  2.6403,  3.1195],
         [-0.4058, -0.4058, -0.4058,  ...,  2.0535,  2.0585,  2.5035],
         ...,
         [-0.4913, -0.4913, -0.4913,  ...,  2.3158,  1.9111,  0.6888],
         [-0.4791, -0.4791, -0.4791,  ...,  2.5169,  1.0506,  1.2555],
         [-0.4929, -0.4929, -0.4929,  ...,  1.7210,  1.3171,  2.9772]]])
torch.Size([1]) - length
tensor([151])

TRT logprobs:
[[2.56757566e-08 1.74633362e-06 1.39345090e-07 ... 2.22639578e-08
  4.72480082e-08 9.99820530e-01]
 [2.23299459e-08 1.47347919e-06 1.12848454e-07 ... 2.53773820e-08
  4.38267804e-08 9.99850273e-01]
 [2.95244362e-08 1.38419557e-06 1.81833116e-07 ... 4.75267399e-08
  5.88024207e-08 9.99843001e-01]
 ...
 [1.09524088e-08 2.64874325e-06 1.86362377e-06 ... 7.11508562e-07
  1.21028393e-07 9.98419762e-01]
 [8.64975220e-08 1.92711759e-05 5.07156528e-06 ... 4.05922992e-06
  1.32947264e-06 9.96621251e-01]
 [1.06756914e-07 1.36985691e-05 1.59037688e-06 ... 9.70281008e-07
  1.15787702e-06 9.99046862e-01]]

jetson output is similar to x86 output now
Thank you for your support, I think that JetPack 4.6.3 should be included in SD card image for jetson nano

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.