How to optimize the usage on jetson's RAM

Hello, I am using Python to develop my yolov8 application on jetson orin nano. I decode two H.264 camera stream and perform inference with .engine file. There two things we noticed:

  1. The usage of RAM is extremely high, we can basically do nothings after we engaged the application.
  2. The preprocessing and postprocessing are time consuming ( and I thinks the postprocessing is not consisted of tracking).

Have anyone tried yolov8 + BoT SORT with Python on any Jetson model? Please offer some advice on optimization on this.

Plus; I also try to develop a c++ application on Jetson. Jetson shares RAM with GPU, maybe I can do somethings like not using cv::cuda::GpuMat?


Could you share more info about your application?
Are you using the frameworks from ultralytics?


Hello, AastaLLL:

Yes, I am using the ultralytics api. Would you tell me what kinds of information you need. Really appreciate your attention, thanks.

Hi, Aasta, this is how our application works, basically:

  1. decode h.264 rtsp camera stream by gstreamer. ( This model don’t have deepstream. I failed to install it. So I needs options except DS)
  2. extract frame data by emitting pull_sample like this:
def new_buffer(sink, data):
    # pull sample
    sample = sink.emit("pull-sample")
    buf = sample.get_buffer()
    caps = sample.get_caps()
    print("frame size: "+str(caps.get_structure(0).get_value('width'))+", "+str( caps.get_structure(0).get_value('height')))
    # create buffer for a RBGA img
    arr = np.ndarray(
        buffer=buf.extract_dup(0, buf.get_size()),
    # extract framdata into 3 channels
    data.img = arr[:,:,0:3]
    data.frame_shape = data.img.shape
    if data.frame < 10000:
        data.frame = data.frame+1
        data.frame = 0
    return Gst.FlowReturn.OK
  1. make a framedata duplicant from buffer for inference in case of missing.
  2. call ultralytics YOLOv8 api like model.track()
  3. get tracking results and print bbox on the framedata by cv2.rectangle().

I think we can directly call infer method with gstreamer buffer, since the gpu and cpu share this 8GB ram.


Sorry for the late update.

Ultralytics generally uses TensorRT for inference which depends on the CUDA library.
Loading CUDA takes memory (>600MB) since it needs to load all the modules to memory.

On CUDA 11.8, we introduce lazy module loading to allow users to only load the required CUDA module which can help to save the memory usage…
Please give it a try ( JetPack 6 with CUDA12).

Moreover, it looks like your preprocess and postprocess is CPU based.
If so, it’s expected to take time since memory transfer (CPU ↔ GPU) is required.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.