Python 3.8 RAM owerflow and loading issues

Hi community,

First, I want to mention, that this is our first project in a bigger scale and therefore we don’t know everything but we learn fast.

We developed a code for image recognition. We tried it with a raspberry pi 4b but quickly faced that this is way to slow overall. Currently we are using a NVIDIA Jetson Nano. The first recognition was ok (around 30 sec.) and the second try was even better (around 6-7 sec.). The first took so long because the model will be loaded for the first time. Via an API the image recognition can be triggered and the meta data from the AI model will be the response. We use fast-API for this.

But there is a problem right now, where if I load my CNN as a global variable in the beginning of my classification file (loaded on import) and use it within a thread I need to use mp.set_start_method(‘spawn’) because otherwise I will get the following error:

“RuntimeError: Cannot re-initialize CUDA in forked subprocess.
To use CUDA with multiprocessing, you must use the ‘spawn’ start method”

Now that is of course an easy fix. Just add the method above before starting my thread. Indeed this works but another challenge occurs at the same time. After setting the start method to ‘spawn’ the ERROR disappears but the Jetson starts to allocate way to much memory.

Because of the overhead and preloaded CNN model, the RAM is around 2.5Gig before the thread starts. After the start it doesn’t stop allocating RAM, it consumes all 4Gig of the RAM and also the whole 6Gig Swap. Right after this, the whole API process kill with this error: “cannot allocate memory” which is obvious.

I managed to fix that as well just by loading the CNN Model in the classification function. (Not preloading it on the GPU as in the two cases before). However, here I got problem as well. The process of loading the model to the GPU takes around 15s - 20s and this every time the recognition starts. This is not suitable for us and we are wondering why we cannot pre-load the model without killing the whole thing after two image-recognitions. Our goal is to be under 5 sec with this.

    #clasify
    import torchvision.transforms as transforms
    from skimage import io
    import time
    from torch.utils.data import Dataset
    from .loader import *
    from .ResNet import *

    #if this part is in the classify() than no allocation problem occurs
    net = ResNet152(num_classes=25)
    net = net.to('cuda')
    save_file = torch.load("./model.pt", map_location=torch.device('cuda'))
    net.load_state_dict(save_file)
    
    def classify(imgp=""):
        #do some classification with the net
        pass

    if __name__ == '__main__':
        mp.set_start_method('spawn') #if commented out the first error ocours
        manager = mp.Manager()
        return_dict = manager.dict()
        p = mp.Process(target=classify, args=('./bild.jpg', return_dict))
        p.start()
        p.join()
        print(return_dict.values())

Any help here will be much appreciated. Thank you.

Hi,

It looks like you are facing some issue related to the CUDA context.
In general, you should store/restore the CUDA context when switching the tasks for each thread.

A sample for CUDA with thread can be found here:

Please let me know if I misunderstood your question.
Thanks.

Hello AastaLLL,
thanks for the reply.

The reason why we use one sub process is because after the classification, the resource are not freed but more and more memory every classification is allocated which results in a kind of leak. There is only one process running the model at any time. There should be only one copy of the model on GPU preloaded with starting the API process and then a API function starts another process that allocates memory for picture calcification and use the model on the GPU (of course the tensor with the picture is loaded on GPU as well). After the classification is done, the resources have to be given back to the OS. Without multiprocessing no resources are given back to OS but more and more are allocated every time. After making the classification a sub process, we are now facing the problem mentioned above.

Hi,

It sounds like some input or output buffer is not be released currently.
Would you mind to share your implementation related to inference?
We cannot not find the inference part in the sample shared above.

Thanks.

Thanks, you were right. The Problem is solved.