Pass a matrix from the GPU to the CPU quickly


I am doing a loop in which I take a frame from the camera, pass it to the GPU, do a preprocessing, do inference with a network, generate a mask (segmentation), pass that mask to the CPU and send the mask via UDP.

I have managed to convert the network to TensorRT, so the inference is done very quickly. Image preprocessing takes about 1 ms, inference takes about 1 ms and passing the mask from GPU memory to CPU memory takes 40 ms.

Now my problem is that passing the mask from the GPU to the CPU is very slow. As in the Jetson (in my case I have a Jetson AGX Orin) the CPU and GPU use the same RAM, is it possible to make a shared memory, so that I don’t have to do the process of passing the mask from one memory to the other?

This is my loop

while True:
    ret, frame =
    if not ret:
    # Preprocess image
    img = model_pidnet.preprocess_image(frame, height, width)

    # Inference
    mask, img = model_pidnet(img)

    # Postprocess
    mask = F.interpolate(mask, size=img.size()[-2:], mode='bilinear', align_corners=True)
    mask = torch.argmax(mask, dim=1).squeeze(0).type(torch.uint8).
    mask = mask.cpu()
    mask = mask.numpy()

The problem is in the line mask = mask.cpu(), it is the one that takes about 40 ms.

The post-processing is also in a function of the class, I have put it like this to better understand the problem

Thanks in advance

Best regards


First, please check if you have maximized the device performance with the following commands:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Could you share how you check the duration of inference?
Usually, the inference is a non-blocking API which means the CPU will execute the next call immediately.
If you measure the duration with the CPU, it’s possible that the inference time is added to the memcpy since memcpy blocks the CPU.

For zero-copy memory, you can try the page-locked memory with PyCUDA:


Hello, thank you for your reply

I am sorry for taking so long to answer because I did not see the email that told me that you had answered me.

The sudo nvpmodel -m 0 command I had executed, what I had not done was sudo jetson_clocks. I will try it

On the other hand, to measure the time what I do is at the start is import time and then

t0 = time.time()
t = time.time()
print(f"Time of operation {(t-t0)*1000} ms")

But it makes sense what you say, it may be that being non blocking, part of the post processing time measurement is inference. So what do you recommend me to measure the time?


You can try to profile the GPU time with the CUDA event.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.