As Matt explained, we are not able to upgrade TensorRT as this model is destined for a device which has already been through substantial compliance testing on Jetpack 5.1 and they are not able to update to Jetpack 6.1 (which is the only version which provides TensorRT 10). We can confirm that the model does execute currently with TensorRT 10, however. Can you please advise on how we might make this model work with Jetpack 5.1 (i.e. Tensor RT 8)
Confirmed that we can reproduce the same issue locally.
Based on the output, we suspect this is related to a known issue that is fixed in JetPack 6.
As we need more time to verify the cause, are you able to run the converter for more times?
The known issue has a 50% failure rate (time-related), it should be possible to get and serialize a working engine if tested multiple times.
Thanks @AastaLLL - it’s great that you can reproduce. Would you mind pointing us to the relevant statement in the release notes which shows the issue is known and resolved? This will help us build a case with our customer to upgrade to JetPack 6.x in the medium term. In the short term they will not be able to upgrade as they have been through significant compliance testing with JetPack 5.1, so we will need to find a workaround. Do you have any suggestions?
I will ask an engineer this morning to rerun a few times to see if it ever converts.
@AastaLLL We’ve run it around 20 times, and so far its failed every time with the same error. Could you perhaps expand on what you mean by “time based”? I’m going to set it up so that it will run continuously overnight so we should get several thousand runs, and I’ll let you know if any succeed.
Hi @AastaLLL, are you any closer to finding a workaround? Or perhaps do you have some suggestions for things we might try which might affect the outcome so that we can have a look ourselves?
We need more time for this issue as we are running out of resources recently.
To check how to WAR this issue, it’s recommended to try our ONNX graph_surgeion tool:
As your model fails around the ‘/ScatterND_6’ layer, we recommend setting the model output with the input and output of ‘/ScatterND_6’.
If the conversion passes without the layer but fails once adding the layer.
Then you can try to replace or play around the layer to find a way to WAR this issue.
Here is a minimal reproducible example of the operation that causes the error
import torch
class DummyOp(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
def forward(self, features: torch.Tensor) -> torch.Tensor:
batch = features.size(0)
n = features.size(1)
diag = torch.eye(n, device=features.device)
diag = diag.repeat(batch, 1, 1)
non_diag_mask = diag == 0
matrix = torch.zeros_like(diag)
matrix[non_diag_mask] = features.flatten() # error
return matrix
def main():
n = 10
model = DummyOp()
data = torch.rand([1, n, n - 1]).to(torch.float32)
torch.onnx.export(
model,
data,
"model.onnx",
export_params=True,
opset_version=11,
do_constant_folding=True,
input_names=["input"],
output_names=["outputs"],
verbose=True,
)
if __name__ == "__main__":
main()
This warning is printed before the error, so it may be a useful hint:
[W] [TRT] Skipping tactic 0x0000000000000000 due to exception Assertion sliceOutDims[i] <= inputDims.d[i] failed.
I managed to get around this by flattening the tensors and indexing with a flat index. However, I think it would be good to know the root cause of this.