Image processing speed issue with CUDA

joe221 · June 14, 2024, 6:03pm

Hey there people, i’m new to CUDA I have speed issues with code down below, when running this program my video feed has a noticeable delay and I assumed that “with vpi.Backend.CUDA:” would execute processing tasks faster than"with vpi.Backend.CPU" as it utilises the gpu on my jetson orin nano but there is no noticeable difference in the delay on my video feed when using vpi.Backend.CUDA vs vpi.Backend.CPU which is confusing me. Any help would be great as I’m still learning many things.
Thanks a bunch
regards
joe

def process_frame_vpi(frame, kernel_size=1):

#Create a VPI stream
stream = vpi.Stream()
with vpi.Backend.CUDA:
    with stream:
     frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
     r, g, b = cv2.split(frame_rgb)
    # Create VPI images for each channel
     r_vpi = vpi.asimage(r, vpi.Format.U8)
     g_vpi = vpi.asimage(g, vpi.Format.U8)
     b_vpi = vpi.asimage(b, vpi.Format.U8)
    # Apply median blur using VPI on each channel
    with stream:
     r_blurred = r_vpi.median_filter((kernel_size, kernel_size), stream=stream)
     g_blurred = g_vpi.median_filter((kernel_size, kernel_size), stream=stream)
     b_blurred = b_vpi.median_filter((kernel_size, kernel_size), stream=stream)
    # Sync the stream to ensure processing is complete
    stream.sync()
    # Retrieve the processed images
    r_processed = r_blurred.cpu()
    g_processed = g_blurred.cpu()
    b_processed = b_blurred.cpu()
# Merge the processed channels back into an RGB image
    with stream:
     processed_frame = cv2.merge([r_processed, g_processed, b_processed])
# Convert the processed frame back to BGR for display
    with stream:
     processed_frame_bgr = cv2.cvtColor(processed_frame, cv2.COLOR_RGB2BGR)
    stream.sync()

    return processed_frame_bgr

def main():
# Open the camera
gst_pipeline =(
'nvarguscamerasrc ! ’
'video/x-raw(memory:NVMM), width=1920, height=1080, framerate=30/1 ! ’
'nvvidconv ! video/x-raw, format=BGRx ! videoconvert ! ’
‘video/x-raw, format=BGR ! appsink’
)
cap = cv2.VideoCapture(gst_pipeline, cv2.CAP_GSTREAMER)

Adjust the camera index if necessary

if not cap.isOpened():
    print("Error: Could not open video stream.")
    return
while True:
    # Capture a frame from the video feed
    ret, frame = cap.read()
    if not ret:
        print("Error: Could not read frame.")
        break

    # Process the frame using VPI
    processed_frame = process_frame_vpi1(frame, 5)

    # Display the original and processed frames
    with vpi.Backend.CUDA:
     cv2.imshow('Processed Frame', processed_frame)
     cv2.imshow('Original Frame', frame)
    
    # Break the loop on 'q' key press
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

if name == “main”:
main()
#planarcode

njuffa · June 14, 2024, 7:00pm

Questions about NVIDIA’s embedded platforms typically receive faster / better / more numerous answers in the sub-forums dedicated to them. In this case:

As a sanity check, you would want to make sure that when you select the GPU backend the GPU is actually being used. There could be a problem that precludes the use of the GPU, with execution then falling back to the CPU backend.

As a next step, I would suggest profiling the code. The Jetson Nano is a low-end device with low-bandwidth memory (the specification says 68 GB/sec). As this is an integrated platform, both CPU and GPU use (and share) the same physical memory. I would expect relatively simple image processing tasks to be limited by memory throughput for both CPU and GPU versions, leading to roughly identical performance.

Profiling of the code would allow you to confirm or refute this hypothesis.

The situation would be different on a host system with a discrete GPU, where a high-end system may sport a CPU that can access system memory at 200 GB/sec whereas the GPU can access its attached memory at 1000 GB/sec.

joe221 · June 14, 2024, 7:49pm

Hey thank for your reply, i have taken two screenshots in jtop this first one uses vpi.Backend.CPU for the program and second one uses vpi.Backend.CUDA in the program. In both there seems to be equal gpu memory (does this mean gpu is used for both) and also i can see that when vpi.Backend.CUDA is used the cpu useage drops to around 130% instead of 255% (so i guess this means that the gpu is used) am i right with these assumptions? Thanks a bunch
Kind regards
Joe

This one uses vpi.Backend.CUDA

Topic		Replies	Views
CUDA is so slow Jetson Nano opencv	5	1291	June 30, 2022
VPI very slow compared to OpenCV CPU Jetson Nano vpi	7	1809	November 10, 2021
Performance about VPI ConvertImageFormat Jetson AGX Orin vpi	4	95	July 18, 2024
Best remap implementation on Jetson Nano Jetson Nano opencv , cuda	16	504	August 1, 2024
CUDA code too slow Jetson Nano cuda	6	1760	July 26, 2022
VPI 1.2 very slow to download image from GPU - any tips? Jetson Nano vpi	7	801	April 17, 2023
Image processing is faster on CPU than with CUDA CUDA Programming and Performance	1	679	September 15, 2018
If the image show and cv2 functions are accelerated by GPU? Jetson Nano	7	1286	October 18, 2021
vpiSubmitTemporalNoiseReduction fails with VPI_ERROR_INVALID_ARGUMENT on buffer created by vpiImageCreateWrapper/VPI_IMAGE_BUFFER_CUDA_PITCH_LINEAR Jetson AGX Orin cuda , vpi	4	27	December 30, 2024
Inexplicable CUDA kernel speedup when using tall thin rectangles to process image Jetson Nano cuda , mmapi	8	501	October 18, 2021

Image processing speed issue with CUDA

Adjust the camera index if necessary

Related topics