[TensorRT 10.x] Is ConvTranspose3d supported in INT8 on Jetson? (QAT Workflow)

Hello everyone,

I am currently working on deploying a 3D Deep Learning model on NVIDIA Jetson. My goal is to maximize performance by running the entire model (or as much as possible) in INT8 precision.

My current workflow is:

  1. Training: PyTorch model training.

  2. Quantization: QAT (Quantization Aware Training) using the PyTorch quantization toolkit / TensorRT Model Optimizer.

  3. Export: Export to ONNX.

  4. Deployment: Building the engine with TensorRT 10.13 (only for test).

The Issue: The quantization works perfectly for most of the model (standard 3D Convolutions), but the ConvTranspose3d layers remain in FP16 (or FP32) in the final engine. They do not seem to run in INT8 despite the QAT calibration.

I have searched through GitHub issues and the forum, but I haven’t found concrete documentation confirming if ConvTranspose3d has INT8 kernel implementations on Jetson (Orin/Xavier) architectures.

Context on the layer:

  • This specific part of the model performs a 16x upsample.

  • I am currently using a strided ConvTranspose3d to achieve this with learnable parameters.

My Questions:

  1. Is ConvTranspose3d supported in INT8 in TensorRT 10.x? Or is the fallback to FP16 expected behavior due to missing kernels?

  2. Are there specific constraints (kernel size, stride, padding) required to trigger the INT8 kernel for 3D Deconvolution?

  3. Architecture Alternatives: Since I need a high-quality 16x upsample, if ConvTranspose3d is not hardware-friendly for INT8, would you recommend a different approach? (e.g., Resize (Nearest/Trilinear) + standard Conv3d, or PixelShuffle 3D) in int8 mod ?

Here is my part with Devonc3D layers :
class Bottleneck3D(nn.Module):

    def __init__(self, c, expansion=2):
        super().__init__()
        mid_c = c * expansion
        self.conv1 = nn.Conv3d(c, mid_c, 1, bias=False)
        self.bn1 = nn.BatchNorm3d(mid_c)
        self.conv2 = nn.Conv3d(mid_c, mid_c, 3, padding=1, bias=False)  # groups=1

        self.bn2 = nn.BatchNorm3d(mid_c)
        self.conv3 = nn.Conv3d(mid_c, c, 1, bias=False)
        self.bn3 = nn.BatchNorm3d(c)
        self.act = nn.ReLU6(inplace=True)

        self.add = FloatFunctional()

    def forward(self, x):
        identity = x
        out = self.act(self.bn1(self.conv1(x)))
        out = self.act(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out = self.add.add(out, identity)
        out = self.act(out)
        return out

class Deconv3DStack(nn.Module):

    def __init__(self, in_c=64, deconv_layers=4, final_c=32):

        super().__init__()
        mid_channels = 256  
        last_channels = 128
        self.block1 = nn.Sequential(  
            nn.ConvTranspose3d(in_c, mid_channels * 2, kernel_size=2, stride=2, bias=False),
            nn.BatchNorm3d(mid_channels * 2),
            nn.ReLU(inplace=True),
            Bottleneck3D(mid_channels * 2),
            Bottleneck3D(mid_channels * 2),
        )
        self.block2 = nn.Sequential( 
            nn.ConvTranspose3d(mid_channels * 2, mid_channels, kernel_size=2, stride=2, bias=False),
            nn.BatchNorm3d(mid_channels),
            nn.ReLU(inplace=True),
            Bottleneck3D(mid_channels),
        )

        self.block3 = nn.Sequential( 
            nn.ConvTranspose3d(mid_channels, last_channels, kernel_size=2, stride=2, bias=False),
            nn.BatchNorm3d(last_channels),
            nn.ReLU(inplace=True),
        )

        self.block4 = nn.Sequential( 
            nn.ConvTranspose3d(last_channels, final_c, kernel_size=2, stride=2, bias=False),
            nn.BatchNorm3d(final_c),
            nn.ReLU(inplace=True),
        )
        self.out_c = final_c

    def forward(self, x):

        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.block4(x)

        return x

Any insights or documentation links would be greatly appreciated!

Thanks.

*** Please note that this reply is generated by LLM automatically ***

To address your question, I’ll break down the key points and provide insights into the deployment of your 3D Deep Learning model on NVIDIA Jetson using PyTorch, QAT, and TensorRT.

PyTorch Quantization Toolkit and QAT

PyTorch provides a quantization toolkit that allows you to perform Quantization-Aware Training (QAT) on your model. QAT is a technique that simulates the effects of quantization during training, allowing the model to adapt to the quantization noise. This is useful for models that are sensitive to quantization noise, such as those with a large number of weights or activations.

TensorRT Model Optimizer and Export to ONNX

After training your model with QAT, you can use the TensorRT Model Optimizer to optimize the model for deployment on NVIDIA hardware. The Model Optimizer takes the trained model and applies various optimizations, such as kernel fusion, constant folding, and dead code elimination, to reduce the computational complexity of the model.

To export the optimized model to ONNX, you can use the torch.onnx.export function. This will create an ONNX model that can be used with TensorRT.

INT8 Kernel Implementations on Jetson Orin/Xavier Architectures

TensorRT does provide INT8 kernel implementations for various layers, including 3D convolutions. However, the availability of INT8 kernels for specific layers and architectures depends on the TensorRT version and the NVIDIA hardware.

For Jetson Orin and Xavier architectures, TensorRT 10.x provides INT8 kernel implementations for 3D convolutions, but there are some limitations and constraints.

  • Kernel size: The kernel size must be a power of 2 (e.g., 2, 4, 8, 16).
  • Stride: The stride must be a power of 2 (e.g., 2, 4, 8, 16).
  • Padding: The padding must be symmetric (e.g., padding=(1, 1, 1)).

If your 3D deconvolution layer does not meet these constraints, the fallback to FP16 is expected behavior.

3D Deconvolution with Learnable Parameters

In your case, you are using a strided 3D deconvolution with learnable parameters to achieve a 16x upsample. While this is a valid approach, it may not be the most hardware-friendly for INT8.

As an alternative, you could consider using a nearest-neighbor or trilinear upsampling layer, which may be more suitable for INT8. These layers are typically less computationally intensive and may provide better performance on Jetson Orin and Xavier architectures.

Recommendations

Based on your requirements, I recommend the following:

  1. Verify that your 3D deconvolution layer meets the constraints for INT8 kernel implementations on Jetson Orin and Xavier architectures.
  2. If the constraints are not met, consider using a nearest-neighbor or trilinear upsampling layer as an alternative.
  3. If you still want to use the strided 3D deconvolution with learnable parameters, you can try using FP16 instead of INT8. This may provide better performance, but it will also increase the memory usage and computational complexity.
  4. Experiment with different approaches and evaluate their performance on your specific use case.

I hope this helps you optimize your 3D Deep Learning model for deployment on NVIDIA Jetson using PyTorch, QAT, and TensorRT.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

1 Like

Thank you for the quick reply.

I have optimized my model, and I am now at around 50% int8 coverage. To do this, I simply exported to ONNX and loaded it into the Nvidia Deep Learning Designer to check it using the int8 option without calibration—just the basic setting.

It works, but I know the model can go further. I see in documentation and forums that calibration can provide much better performance, but I am not sure if other techniques like PTQ can increase the number of layers running in int8. I know the model is capable of more than 50% int8. However, from what I understand, calibration is deprecated in TensorRT.

Can PTQ help increase the int8 coverage? For now, I am testing with TensorRT 10.3 on my PC with an RTX5070.

If PTQ works, I would appreciate any links to official methods.

Hi @utilisateur4351 , Thanks for the update! Great to hear you have a baseline working.

Well, PTQ won’t increase your INT8 coverage here. The ~50% you’re seeing isn’t a calibration problem. I’ve asked internal teams whether ConvTranspose3d has INT8 kernel support on Jetson Orin/Xavier. will update you once I hear back.

Meanwhile, the workaround could be to replace ConvTranspose3d layers with Upsample + Conv3d:

class UpsampleBlock(nn.Module):
    def __init__(self, in_c, out_c):
        super().__init__()
        self.block = nn.Sequential(
            nn.Upsample(scale_factor=2, mode='nearest'),  # Fast, no weights
            nn.Conv3d(in_c, out_c, kernel_size=3, stride=1, padding=1, bias=False),  # INT8 supported!
            nn.BatchNorm3d(out_c),
            nn.ReLU(inplace=True)
        )

This is the standard workaround for 3D upsampling on edge devices. Swap one block, re-export, and check the logs.

Thank you.

Thank you for your quick response.
In fact, the problem was not necessarily the model but more specifically the QAT phase.

I am using Nvidia Deep Learning Designer.
When I import my model without calibration, just in ONNX version 18, the TensorRT engine displays an inference time of 7.5 ms with the following parameters:
fp16 true
bf16 true
int8 true
I have all the conv2D, conv3D, transpose layers, etc. in int8. So it’s fine.
I’m happy with that. I think that if I switch to QAT, I can either gain accuracy and perhaps gain a little int8 coverage.
To do this, I use TensorRT Model Optimizer in Python. I followed the example:

I import the QAT ONNX file into Deep Learning Designer with the options
fp16 true
bf16 true
I get an inference time of only 9.8 ms.
Why is QAT worse than without calibration?
I am using the same model, so i dont understand.
I can send you my logs on the different parts.
Here is the type of code of QAT phase:

def export_onnx(model, args, epoch):
    print(f"\n====== EXPORT ONNX epoch {epoch+1} ======\n")
    model.eval()

    B = 1
    Cams = len(CAM_NAMES)
    H, W = IMAGE_SIZE
    dummy_imgs = torch.randn(B, Cams, 3, H, W, device=DEVICE)
    dummy_points = torch.randn(B, N_POINTS_FIXED, 3, device=DEVICE)

    onnx_path = os.path.join(args.out_dir, f"surroundocc_qat_int8_epoch_{epoch+1:04d}.onnx")
    model_name = f"surroundocc_qat_epoch_{epoch+1}"
    onnx_bytes, _ = get_onnx_bytes_and_metadata(
        model,
        (dummy_imgs, dummy_points),
        model_name=model_name,
        onnx_opset=18  
    )
    onnx_bytes_obj = OnnxBytes.from_bytes(onnx_bytes)
    onnx_bytes_obj.write_to_disk(os.path.dirname(onnx_path), clean_dir=False)

    print(f"✔ ONNX exporté pour l'epoch {epoch+1} : {onnx_path}")


def _forward_loop(m):
    forward_loop_for_calib(m, calib_loader, DEVICE)

    model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, _forward_loop)
    mtq.print_quant_summary(model)  # Avant QAT
    export_onnx(model, args, 10)
    print("✅ Modèle fake-quantized, prêt pour QAT INT8")

    # ====== 4. Optimizer ======
    opt = torch.optim.AdamW(
        model.parameters(),
        lr=args.lr,
        weight_decay=args.weight_decay
    )