Description
I have a PyTorch model with a basic transformer architecture. I convert this model into an ONNX model, and use that ONNX model to create a TensorRT model. My use case must be compatible with all three model formats.
When using GPU, I can confirm that each model produces approximately equal output using the same input when run separately. However, when running the ONNX model and TensorRT model in the same script, I receive the following error during TensorRT inference:
[TRT] [E] 1: [gemmBaseRunner.cpp::executeGemm::468] Error Code 1: Cask (Cask Gemm execution)
which results in output of all 0s.
I have determined that the issue occurs when the ONNX runtime session is initialized after I initialize my TensorRT model. If initialized before, the error does not occur.
At this point, I can refactor my codebase to take special care in the ordering of when runtimes are created/models are initialized, but I am curious if there is a issue here that might bite me later, and also just curious as to why this might be happening at all.
One possibly relevant implementation detail (though I don’t think it is the issue) is the model is exported to ONNX with two dynamic axes — the first dimension being batch size and the second dimension representing token length.
A complete reproducible example is included in the Steps To Reproduce section, though for convenience I will paste what I believe is the most relevant code here.
The ONNX wrapping class definition:
class ONNXModelWrapper():
def __init__(self, file, precision):
self.precision = precision
self.load(file)
self.stream = None
def load(self, file):
with open(file, "rb") as f:
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
def allocate_memory(self, batch):
# Allocate device memory
self.output = cuda.pagelocked_empty(
tuple(trt.Dims((batch.shape[0], 1))),
dtype=self.precision,
)
self.d_output = cuda.mem_alloc(self.output.nbytes)
self.b_input = batch.nbytes
self.d_input = cuda.mem_alloc(1 * batch.nbytes)
self.bindings = [int(self.d_input), int(self.d_output)]
if self.stream is None:
self.stream = cuda.Stream()
def predict(self, batch):
if isinstance(batch, torch.Tensor):
batch = batch.detach().cpu().numpy()
if self.stream is None or batch.nbytes != self.b_input:
self.allocate_memory(batch)
self.context.set_input_shape(self.engine.get_tensor_name(0), batch.shape)
# Transfer input data to device
cuda.memcpy_htod_async(
self.d_input,
cuda.register_host_memory(np.ascontiguousarray(batch.ravel())),
self.stream,
)
self.context.set_tensor_address(self.engine.get_tensor_name(0), self.d_input)
self.context.set_tensor_address(self.engine.get_tensor_name(1), self.d_output)
# Execute model
assert self.context.all_binding_shapes_specified
self.context.execute_async_v3(self.stream.handle)
# Synchronize threads
self.stream.synchronize()
# Transfer predictions back
cuda.memcpy_dtoh_async(self.output, self.d_output, self.stream)
self.stream.synchronize()
return self.output
def __call__(self, batch):
return self.predict(batch)
The problem arises during initialization and inference:
trt_model = ONNXModelWrapper(trt_file, np.float32)
ort_session = onnxruntime.InferenceSession(
onnx_path,
providers=["CUDAExecutionProvider"],
)
sample_input = torch.randn(1, 30, 300)
ort_inputs = {
ort_session.get_inputs()[0].name: sample_input.numpy(),
}
ort_outs = ort_session.run(None, ort_inputs)
trt_outs = trt_model(sample_input.numpy())
If the first two lines are swapped, that is, if the ONNX session is created before my TensorRT model is initialized, then this code works as intended. Otherwise, I receive the GEMM error.
Also note that if the ONNX runtime is initialized to use the CPU provider, the error does not occur in either case.
What might be the issue here? Is there a memory allocation flaw in my code?
Environment
TensorRT Version: 10.0.1
GPU Type: Tesla T4
Nvidia Driver Version: 555.42.06
CUDA Version: 12.4
CUDNN Version: 8.9.2
Operating System + Version: Linux (Ubuntu 20.04.1)
Python Version (if applicable): 3.10.13
TensorFlow Version (if applicable): N/A
PyTorch Version (if applicable): 1.13.1
Baremetal or Container (if container which image + tag): Baremetal
Steps To Reproduce
I have attached an reproducible example here, along with a requirements file for my relevant dependencies.
trt-onnx-test.zip (2.8 KB)
It is not exactly a minimum reproducible example, as it replicates exactly the simple model that I am using, though it is fairly barebones.
Note that there is no actual Python runtime exception that is raised; the script executes without error, though TensorRT logs the error and produces unintended output (all 0s).