Image processing speed issue with CUDA

Hey there people, i’m new to CUDA I have speed issues with code down below, when running this program my video feed has a noticeable delay and I assumed that “with vpi.Backend.CUDA:” would execute processing tasks faster than"with vpi.Backend.CPU" as it utilises the gpu on my jetson orin nano but there is no noticeable difference in the delay on my video feed when using vpi.Backend.CUDA vs vpi.Backend.CPU which is confusing me. Any help would be great as I’m still learning many things.
Thanks a bunch

def process_frame_vpi(frame, kernel_size=1):

#Create a VPI stream
stream = vpi.Stream()
with vpi.Backend.CUDA:
    with stream:
     frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
     r, g, b = cv2.split(frame_rgb)
    # Create VPI images for each channel
     r_vpi = vpi.asimage(r, vpi.Format.U8)
     g_vpi = vpi.asimage(g, vpi.Format.U8)
     b_vpi = vpi.asimage(b, vpi.Format.U8)
    # Apply median blur using VPI on each channel
    with stream:
     r_blurred = r_vpi.median_filter((kernel_size, kernel_size), stream=stream)
     g_blurred = g_vpi.median_filter((kernel_size, kernel_size), stream=stream)
     b_blurred = b_vpi.median_filter((kernel_size, kernel_size), stream=stream)
    # Sync the stream to ensure processing is complete
    # Retrieve the processed images
    r_processed = r_blurred.cpu()
    g_processed = g_blurred.cpu()
    b_processed = b_blurred.cpu()
# Merge the processed channels back into an RGB image
    with stream:
     processed_frame = cv2.merge([r_processed, g_processed, b_processed])
# Convert the processed frame back to BGR for display
    with stream:
     processed_frame_bgr = cv2.cvtColor(processed_frame, cv2.COLOR_RGB2BGR)

    return processed_frame_bgr

def main():
# Open the camera
gst_pipeline =(
'nvarguscamerasrc ! ’
'video/x-raw(memory:NVMM), width=1920, height=1080, framerate=30/1 ! ’
'nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! ’
‘video/x-raw, format=BGR ! appsink’
cap = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)

Adjust the camera index if necessary

if not cap.isOpened():
    print("Error: Could not open video stream.")
while True:
    # Capture a frame from the video feed
    ret, frame =
    if not ret:
        print("Error: Could not read frame.")

    # Process the frame using VPI
    processed_frame = process_frame_vpi1(frame, 5)

    # Display the original and processed frames
    with vpi.Backend.CUDA:
     cv2.imshow('Processed Frame', processed_frame)
     cv2.imshow('Original Frame', frame)
    # Break the loop on 'q' key press
    if cv2.waitKey(1) & 0xFF == ord('q'):

if name == “main”:

Questions about NVIDIA’s embedded platforms typically receive faster / better / more numerous answers in the sub-forums dedicated to them. In this case:

As a sanity check, you would want to make sure that when you select the GPU backend the GPU is actually being used. There could be a problem that precludes the use of the GPU, with execution then falling back to the CPU backend.

As a next step, I would suggest profiling the code. The Jetson Nano is a low-end device with low-bandwidth memory (the specification says 68 GB/sec). As this is an integrated platform, both CPU and GPU use (and share) the same physical memory. I would expect relatively simple image processing tasks to be limited by memory throughput for both CPU and GPU versions, leading to roughly identical performance.

Profiling of the code would allow you to confirm or refute this hypothesis.

The situation would be different on a host system with a discrete GPU, where a high-end system may sport a CPU that can access system memory at 200 GB/sec whereas the GPU can access its attached memory at 1000 GB/sec.

1 Like

Hey thank for your reply, i have taken two screenshots in jtop this first one uses vpi.Backend.CPU for the program and second one uses vpi.Backend.CUDA in the program. In both there seems to be equal gpu memory (does this mean gpu is used for both) and also i can see that when vpi.Backend.CUDA is used the cpu useage drops to around 130% instead of 255% (so i guess this means that the gpu is used) am i right with these assumptions? Thanks a bunch
Kind regards

This one uses vpi.Backend.CUDA