safeContext.cpp (184) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

Description

I am trying to enqueue a a inference task to the ExecutionContext but receive the following error:

safeContext.cpp (184) - Cudnn Error in configure: 7 (CUDNN_STATUS_MAPPING_ERROR)

This happens with both enqueue and enqueueV2.
Synchronous execution ( execute() ) works without producing an error.
As recommended in the best practices, I deserialize the engine from file, which is a modified YoloV3 loaded from onnx.
I only have 1 engine and 1 ExecutionContext at the moment, but they don’t run on the applications main thread.

Could anyone point me in the right direction?

Environment

TensorRT Version: 7.2.1
GPU Type: GTX 1070 Driver Version: 455.38
Nvidia Driver Version:
CUDA Version: 11.1
CUDNN Version: Unsure using an official container image
Operating System + Version: Ubuntu 18.04
Container: nvcr.io/nvidia/tensorrt:20.10-py3

Working code

cudaMemcpyAsync(deviceBuffer[0], p_data,
                context.binding.deviceBuffer[0].getSize(),
                cudaMemcpyHostToDevice, context.stream);

if (context.context->execute(p_batchSize, &deviceBuffer[0]) != true) {
    LOG(ERROR) << "SyncInference failed!";
}

for (auto i = 1; i < deviceBuffer.size(); ++i) {
    cudaMemcpyAsync(context.binding.hostBuffer[i].get(), deviceBuffer[i],
                    context.binding.hostBuffer[i].getSize(),
                    cudaMemcpyDeviceToHost, context.stream);
}

Non working

cudaMemcpyAsync(deviceBuffer[0], p_data,
                context.binding.deviceBuffer[0].getSize(),
                cudaMemcpyHostToDevice, context.stream);

context.context->enqueue(p_batchSize, &deviceBuffer[0], context.stream, nullptr);    

for (auto i = 1; i < deviceBuffer.size(); ++i) {
    cudaMemcpyAsync(context.binding.hostBuffer[i].get(), deviceBuffer[i],
                    context.binding.hostBuffer[i].getSize(),
                    cudaMemcpyDeviceToHost, context.stream);
}