Error while moving data from cuda-capable device to host memory - Error Code 1: Cuda Runtime (unspecified launch failure)

I am following this tutorial for speeding up object detection inference on my nvidia jetson nano.
I have done the following steps:

  1. Converted .onnx model to .plan (engine file) and saved to disk.

  2. loaded the engine using the following code.

with open(plan_path, ‘rb’) as f:
engine_data =
engine = trt_runtime.deserialize_cuda_engine(engine_data)

  1. Allocated buffers in host (memory) and device (GPU) for input data and output data.

h_input_1 = cuda.pagelocked_empty(batch_size *
trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype($

print(‘size hinput’, h_input_1.size)
h_output = cuda.pagelocked_empty(batch_size *
trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype($
print(trt.volume(engine.get_binding_shape(1)), "trt.volume(engine.get_bindin$
print(‘size h_output’, h_output.size)

d_input_1 = cuda.mem_alloc(h_input_1.nbytes)
print( h_input_1.nbytes)

d_output = cuda.mem_alloc(h_output.nbytes)
print(d_output, h_output.nbytes)

stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream

  1. Copy input data (image) from host memory to Device memory using cuda.memcpy_htod_async(d_input_1, h_input_1, stream) which is successfully completed.
  2. Run the object detection network on loaded input data
  3. Move output data from device memory to host memory using cuda.memcpy_dtoh_async (h_output, d_output, stream).

I am facing issue in step 6 which has been shown in the following image.

Moreover, I can successfully run the inference using detectnet executable code using the following command line:

detectnet --model=models/person/ssd-mobilenet.onnx --labels=models/person/labels.txt
–input-blob=input_0 --output-cvg=scores --output-bbox=boxes
“$IMAGES/person/000001.jpg” $IMAGES/test/person/000001.jpg

I believe (not sure) it creates a runtime_engine to do the inference. Also i could see that it is using GPU when it runs (using jtop).

I have the following questions:

  1. What and why is the Cuda error and how to solve it?
  2. I have shown above that I can use detectnet command to run inference. Which one should be preferred for real-time application and correct way of using it?
  3. What is the differnce between the two ways?


The error indicates the GPU task doesn’t finish yet.
Please add a synchronization call to make sure the job has finished.

The inferenece backend are all TensorRT.
You can choose one based on your preference.

If you need to combine with some multimedia usage, like camera input, encoder, decoder, …, etc.
You can also give our Deepstream SDK a try: