GPU requirements

Hi,
I have a program which takes an image as input and has three parameters which changes the output. My current setup is a jetson orin nano and it takes around atleast 20 - 30 seconds to run the program depending on the size of the input image and its content. I want to make an interactive demo (web application) of the program with sliders to change the 3 parameters and show the output. What kind of GPU hardware will be needed to run the program in real-time?

This is a sample run of the program.
python3 pyrebel_main_vision.py --input images/lotus.jpg --abs_threshold 10 --edge_threshold 5 --bound_threshold 100

The output is written to ‘output.png’.

The program setup.
git clone https://github.com/ps-nithin/pyrebel
cd pyrebel
pip install . or pip install pyrebel
cd demo
python3 pyrebel_main_vision.py --input <image filename> --abs_threshold <value1> --edge_threshold <value2> --bound_threshold <value3>

Thanks,

Can you define “real time”? What frame rate are you envisioning? My understanding is that a delay < 50 milliseconds (0.05 seconds) is generally considered lag free, which means you would have to achieve a speedup of 600x, unobtainable with any currently shipping GPU.

[Later:] A comparison of specifications suggests that the latest RTX 5090 would provide (very roughly) 25x the performance of the Jetson Orin Nano. This is based on a comparison of their respective memory bandwidth numbers, as image processing tends to be bottlenecked on that.

[Even later:] TechPowerUp on the other hand estimates a speedup of 47x between Jetson Orin Nano and RTX 5090.

Hi, I meant something like 1 second or under.

In that case, if TechPowerUp’s estimates are in the ballpark, an RTX 4090, RTX 5080, or RTX 5090 would provide the required speedup of > 30x.

I am very skeptical that comparison based on specification is a good indication of real world performance. The program takes around 26s in the first run and around 18s on subsequent runs on an orin nano. The same program takes 104s in the first run and 79s on subsequent runs on a jetson nano. That means around 4x real world performance. But orin nano is considered to upto 80x better than jetson nano. I guess there are many other factors that affects real world performance.

Would be great if someone could run the program on an RTX 4090 and share the results.

You are correct to be skeptical. At the same time, nobody can provide accurate scaling estimates for code they haven’t seen, haven’t run, haven’t analyzed, and haven’t profiled.

I provided what I consider a lower bound of 25x for the GTX 5090 based on a comparison of the respective memory bandwidth numbers on the assumption that the task is bound by memory bandwidth. This assumption is based on my experience with image processing. While it is correct that one cannot expect more than 85% to 90% of the theoretical memory bandwidth to be realized in practice, that applies to both platforms equally, so the scale factor remains unaffected.

From my understanding, the TechPowerUp numbers for relative performance are derived from collecting results of runs of a proprietary benchmark including both compute bound and memory bandwidth bound portions. One would therefore expect these to be higher than for a purely memory bandwidth bound workload, and in fact they are, with the RTX 5090 listed at 47x of the Jetson Orin Nano, and the RTX 4090 and RTX 5080 at a bit over 30x.

If the workload were completely compute bound (that is quite unlikely these days, as memory bandwidth simply has not kept up with the growth in compute engine throughput), one would expect something like a 75x speedup for the RTX 5090 from the ratio of the FP32 throughput.

If you cannot get your hands on these highest end consumer cards, try a number of discrete GPUs and compare the scaling you observe with the corresponding TechPowerUp estimates. With several measurements, preferably covering a largish part of the performance spectrum, you might then be in a position to roughly extrapolate performance to the RTX 5090.

I am sorry. If you haven’t noticed. I had shared the code in my first post.

I am staring very intently at your initial post in this thread right now. I do not see any clickable link to the code, nor any code attached, nor code inlined into the post. I also do not own an RTX 4090.

FWIW, based on memory bandwidth alone, the scale factor between Jetson Orin Nano and RTX 4090 would be 14.5x (1.01 TB/sec vs 68.3 GB/sec).

# Copyright (C) 2024-2025 Nithin PS.
# This file is part of Pyrebel.
#
# Pyrebel is free software: you can redistribute it and/or modify it under the terms of 
# the GNU General Public License as published by the Free Software Foundation, either 
# version 3 of the License, or (at your option) any later version.
#
# Pyrebel is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; 
# without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR 
# PURPOSE. See the GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along with Pyrebel.
# If not, see <https://www.gnu.org/licenses/>.
#

import numpy as np
from PIL import Image
import math,argparse,time,sys
from pyrebel.preprocess import Preprocess
from pyrebel.abstract import Abstract
from pyrebel.learn import Learn
from pyrebel.edge import Edge
from pyrebel.utils import *
from pyrebel.getnonzeros import *     
        
# This is a demo of forming 2D sketch using abstraction of data.
# When you run this program the output is written to 'output.png'.
#

parser=argparse.ArgumentParser()
parser.add_argument("-i","--input",help="Input file name.")
parser.add_argument("-at","--abs_threshold",help="Threshold of abstraction.")
parser.add_argument("-et","--edge_threshold",help="Threshold of edge detection.")
parser.add_argument("-s","--bound_threshold",help="Threshold of boundary size.")
args=parser.parse_args()

if args.edge_threshold:
    edge_threshold=int(args.edge_threshold)
else:
    edge_threshold=5   
if args.abs_threshold:
    abs_threshold=int(args.abs_threshold)
else:
    abs_threshold=10    
if args.bound_threshold:
    bound_threshold=int(args.bound_threshold)
else:
    bound_threshold=100
 
while 1:
    start_time=time.time()    
    if args.input:
        img_array=np.array(Image.open(args.input).convert('L'))
    shape_d=cuda.to_device(img_array.shape)
    
    threadsperblock=(16,16)
    blockspergrid_x=math.ceil(img_array.shape[0]/threadsperblock[0])
    blockspergrid_y=math.ceil(img_array.shape[1]/threadsperblock[1])
    blockspergrid=(blockspergrid_x,blockspergrid_y)
    img_array_d=cuda.to_device(img_array)
    edge=Edge(img_array)
    edge.find_edges(edge_threshold)
    edges=edge.get_edges_bw()
    
    # Initialize the preprocessing class.
    pre=Preprocess(edges)
    # Set the minimum and maximum size of boundaries of blobs in the image. Defaults to a minimum of 64.
    pre.set_bound_size(bound_threshold)
    # Perform the preprocessing to get 1D array containing boundaries of blobs in the image.
    pre.preprocess_image()
    # Get the 1D array.
    bound_data=pre.get_bound_data()
    bound_data_d=cuda.to_device(bound_data)
    bound_mark=pre.get_bound_mark()
    bound_mark_d=cuda.to_device(bound_mark)
    # Initialize the abstract boundary.
    init_bound_abstract=pre.get_init_abstract()
    
    # Get 1D array containing size of boundaries of blobs in the array.
    bound_size=pre.get_bound_size()

    print("len(bound_data)=",len(bound_data))
    print("n_blobs=",len(bound_size))
    
    scaled_image=pre.get_image_scaled()
    scaled_image_d=cuda.to_device(scaled_image)
    scaled_shape=scaled_image.shape
    scaled_shape_d=cuda.to_device(scaled_shape)
    
    # Initialize the abstraction class
    abs=Abstract(bound_data,len(bound_size),init_bound_abstract,scaled_shape,True)
    abs.do_abstract_all(abs_threshold)
    abs_points=abs.get_abstract()
    abs_size=abs.get_abstract_size()
    abs_size_d=cuda.to_device(abs_size)

    abs_draw=decrement_by_one_cuda(abs_points)
    abs_draw_d=cuda.to_device(abs_draw)
    
    out_image=np.full(img_array.shape,255,dtype=np.int32)
    out_image_d=cuda.to_device(out_image)
    
    bound_data_orig=np.zeros(len(bound_data),dtype=np.int32)
    bound_data_orig_d=cuda.to_device(bound_data_orig)
    
    scale_down_pixels[len(bound_data),1](bound_data_d,bound_data_orig_d,scaled_shape_d,shape_d,3)
    cuda.synchronize()
    
    draw_lines[len(abs_draw),1](abs_draw_d,abs_size_d,bound_data_orig_d,bound_mark_d,out_image_d,0,3)
    cuda.synchronize()
    
    out_image_h=out_image_d.copy_to_host()
    
    # Save the output to disk.
    Image.fromarray(out_image_h).convert('RGB').save("output.png")
    print("Finished in",time.time()-start_time,"seconds at",float(1/(time.time()-start_time)),"fps.")

Hi, this is the code i was mentioning.

IMHO as one exception I would count, if the kernel can take use of the larger L2 cache of >= Ada GPUs (okay, some datacenter GPUs like A100 had large L2 earlier on).

But for it to make a difference, the kernel needs at least some complexity to load lots of far data per processed item. Or be able to store all the data between several kernels in the enlarged L2.

I have no evidence but is memory bandwidth throttling the program.

As njuffa pointed out, compute speed has been improved a lot, so it is likely the program is memory bound for newer GPUs. You can run Nsight Compute to get separate compute and memory requirements.

Even for (in theory) compute bounds algorithms it often takes major hand-optimization efforts to load data fast enough for compute to max out. So memory-bound kernels are the typical case now.

(Not counting code using slow operations like double types on consumer GPUs, sqrt, divisions, …)

What optimization can i make, software or hardware. I am no expert on either.

Some general advice:

Use profiling to find out what the bottleneck is and concentrate on optimizing that.

For memory transfers between host and device pinned memory can help.

Overlapping memory transfers and computations saves time. If you have a single image, consider dividing it into blocks.

Make sure you can do enough work in parallel. Especially on newer GPUs the number of blocks and thus overall threads should be several hundred thousand. Often algorithms have to be changed from a CPU version which loads and stores in the same memory block for a single thread to a version which stores into multiple memory blocks and later combines those.

The code is located at pyrebel/src/pyrebel at main · ps-nithin/pyrebel · GitHub
The demo script pyrebel_main_vision.py which i mentioned in the initial post is located at pyrebel/demo at main · ps-nithin/pyrebel · GitHub

Not sure how to profile and optimize the code. If someone could help me out.

Thanks,

There are many resources available to learn how to profile. You should also learn some basics about analysis-driven optimization (in my opinion). You can find forum posts that discuss how to profile, and ADO. This online tutorial covers perf analysis in session 8.

A typical methodology to get started would be to use nsight systems to get an overall timeline, and then either use the pareto provided by nsight systems, or your own “visual study” of the timeline to decide which are the longest duration activities that you want to focus on first.

After that step, if you decide to focus on a single kernel (for example), you might switch to using nsight compute, and see what are the limiters reported to you.

“But all this is for CUDA C++, I am using Python!” Actually there are very few changes. nsight compute and nsight systems can profile numba cuda python. There are a few things to be aware of:

  1. You may need to instruct the profilers to profile child processes.
  2. You may need to do a little bit of cognitive work to connect the operations (e.g. kernel names) in numba CUDA with what you see in the profiler report.

Although it refers to pytorch, here’s a note written by a NVIDIA expert that demonstrates how you could get started.

Hi,
By modifying the kernel launch configuration i was able to double the speed of the program. There are two kernels which the nsight-compute says No quantifiable performance optimization opportunities were discovered but is consuming much time. How can i speed up such kernels. I was also not able to find stats for time taken for data transfers between device and host in nsight-compute.

Thanks,

Nsight Systems is the tool to use for measuring this.