'cudaMemcpy() failed to allocate memory' in Threaded module

Hi, Could I please have some of your wisdom to sort out my latest problem.

I am porting a program from using cv2 to using jetson.utils prior to adding jetson.inference
specifically I am using jetson.utils.videoSource and jetson.utils.videoOutput
I have completely gutted the program down to the version attached which has only one input and one output but retains 2 of the threads - that is sufficient to bring on the error.
the error is…

  File "/home/jc/jcCode/gutted/car_controller_P6_0v081g.py", line 80, in get_camera_frame
    rg.G_CudaImageToUse = jetson.utils.cudaMemcpy(image)   # 0v81
Exception: jetson.utils -- cudaMemcpy() failed to allocate memory

My basic strategy is to

  • allocate one global image for all the ‘consumers’ to use (rg.G_CudaImageToUse)
  • get_camera_frame()
    • gets an image - into a local variable
    • copies this to to global area and marks it valid ’ rg.G_CudaImageToUse = jetson.utils.cudaMemcpy(image) ’
  • show_image_in_window()
    • is running at a different frequency
    • retrieves the image to a local area (TempCudaImageToUse = jetson.utils.cudaMemcpy(rg.G_CudaImageToUse)
    • writes all over it and displays it

In my simple thinking there should only be 3 ‘allocate memory’ steps.
One for the global and one each for the local area variables. Once underway things should just copy
In this version I have delayed the start of show_image_in_window() and it seems to swim along OK until that starts then
it goes pear shaped.
Am I doing something fundamentally wrong with this strategy ? (it worked ok for cv2)
Can I play with cuda memory buffers this way ?
Finally - is there a clue in ‘cudaMemcpy() failed to allocate memory’ what is cudaMemcpy doing trying to allocate memory - it is supposed to be copying it.

Thanks in anticipation
cudaMemcpy failed to allocate memory.log (10.8 KB)
car_controller_P6_0v081g.py (10.8 KB)
rg.py (2.0 KB)


The error comes from the following line, which fails to allocate a unified memory.

Have you run the same source without using thread?
If not, could you give it a try?


Thanks AastaLLL,
I have run many other, simpler, non-threaded, routines using the commands but it is not possible to run this one without the threads.
Thats why I was asking - what is special about using cuda related things in a threaded environment ?
eg What is the scope of any variable created with cudaAllocMapped() does it behave like any other python variable?
And what is cudaMemcpy doing trying to allocate memory in the first place?

It only allocates it if you use the single-parameter version of jetson.utils.cudaMemcpy(), where the dst buffer isn’t provided by the user. In this case, the function allocates another buffer of the same size, and copies the src buffer to it. If you instead use it like cudaMemcpy(dst, src), it won’t do any allocation.

Are these Python “threads” running in different processes? CUDA memory and contexts aren’t shared across processes.

Dusty thanks
No they are not in different processes.
I am currently using rg.G_CudaImageToUse = jetson.utils.cudaMemcpy(image) I will try jetson.utils.cudaMemcpy(rg.G_CudaImageToUse, image)

That works much better! I stopped it at 5000 frames but seeing it used to crash between 30 and 100 or so I think its cured. Does sound like a deep issue though where that (temporary) allocation is not being released cleanly.

Hmm…it may be that your board was running out of memory after it kept allocating frames. My guess is the Python garbage collector wasn’t deleting it. You could explicitly try del rg.G_CudaImageToUse after you are done with it. Then again, you have it working now (and it’s better to pre-allocate anyways), so just go with that. Glad that you got it working!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.