Python crash calling memcpy_htod_async


We are working on a Jetson Xavier NX with Jetpack 4.4.
We are trying to implement a TensorRT engine using Python and then use the whole module as a service from C++.
The Python code loads an existing TensorRT model and then receives a picture from the C++ code and uses it in the model.
We already have a similar setup that uses Python code to work with a different computing platform(Coral’s Edge TPU).
We tested the Python code as a standalone(no C++), reading a picture from a file. This method proved as working as expected.
When we integrated the Python code from C++ using boost python we are crashing while calling pycuda.driver.memcpy_htod_async with this printed :

We checked the data format and content and it is the same in both cases(running standalone Python and running via C++)
Is there a way to understand what this assert mean and what can we do?


TensorRT Version:
GPU Type: Nvidia CUDA
Nvidia Driver Version: Jetpack 4.4
CUDA Version: 10.2.89
CUDNN Version: 8
Operating System + Version: Ubuntu 18.04 LTS
Python Version (if applicable): 3.6.9
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

Snippet from the Python code:

def init(self, model, labels):

print('Initializing TensorRT engine...')
# loop over the class labels file
for row in open(labels):
	# unpack the row and update the labels dictionary
	(classID, label) = row.strip().split(maxsplit=1)
	self.labels[int(classID)] = label.strip()

TRT_LOGGER = trt.Logger(trt.Logger.INFO)
trt.init_libnvinfer_plugins(TRT_LOGGER, '')
self.runtime = trt.Runtime(TRT_LOGGER)
self.layout = 7 # size of one-detection tuple (index, label, conf, xmin, ymin, xmax, ymax)
self.height = 300
self.width = 300
### create engine ###
with open(model, 'rb') as f:
	buf =
	self.engine = self.runtime.deserialize_cuda_engine(buf)

### create buffer ###
self.host_inputs  = []
self.cuda_inputs  = []
self.host_outputs = []
self.cuda_outputs = []
self.bindings = [] = cuda.Stream()

for binding in self.engine:
	size = trt.volume(self.engine.get_binding_shape(binding)) * self.engine.max_batch_size
	self.host_mem = cuda.pagelocked_empty(size, np.float32)
	self.cuda_mem = cuda.mem_alloc(self.host_mem.nbytes)
	if self.engine.binding_is_input(binding):
self.context = self.engine.create_execution_context()  

def eval(self, img, width, height):

len = width * height * 3
image = np.frombuffer(img, dtype=np.uint8, count=len)
image = image.reshape(width, height, 3)
image = (2.0/255.0) * image - 1.0
image = image.transpose((2, 0, 1))

np.copyto(self.host_inputs[0], image.ravel())

cuda.memcpy_htod_async(self.cuda_inputs[0], self.host_inputs[0],

Hi @ilya9,

We request you to please share complete issue reproducible scripts and model for better assistance.

Thank you.

We resolved this by using the context properly in a multi-threaded environment.

1 Like