Hello
I am doing a loop in which I take a frame from the camera, pass it to the GPU, do a preprocessing, do inference with a network, generate a mask (segmentation), pass that mask to the CPU and send the mask via UDP.
I have managed to convert the network to TensorRT, so the inference is done very quickly. Image preprocessing takes about 1 ms, inference takes about 1 ms and passing the mask from GPU memory to CPU memory takes 40 ms.
Now my problem is that passing the mask from the GPU to the CPU is very slow. As in the Jetson (in my case I have a Jetson AGX Orin) the CPU and GPU use the same RAM, is it possible to make a shared memory, so that I don’t have to do the process of passing the mask from one memory to the other?
This is my loop
while True:
ret, frame = video_frame.read()
if not ret:
continue
# Preprocess image
img = model_pidnet.preprocess_image(frame, height, width)
# Inference
mask, img = model_pidnet(img)
# Postprocess
mask = F.interpolate(mask, size=img.size()[-2:], mode='bilinear', align_corners=True)
mask = torch.argmax(mask, dim=1).squeeze(0).type(torch.uint8).
mask = mask.cpu()
mask = mask.numpy()
The problem is in the line mask = mask.cpu()
, it is the one that takes about 40 ms.
The post-processing is also in a function of the class, I have put it like this to better understand the problem
Thanks in advance
Best regards