Combining PyTorch and CUDA leads to memory consumption increase when running within the same process

Hi everyone!

I have been facing a weird behaviour in memory consumption when building a GStreamer element that combines PyTorch and CUDA.

First, I constructed an element that uses NVMM buffers and I do some computation with CUDA on the incoming buffers in place. It was working like a charm. Even, I was able to place up to six parallel pipelines that use the element without any issue more than the actual GPU usage. However, I wanted to add some object detection functionality to the pipeline and tried to integrate PyTorch into the element. When I just link Torch library to the element, the memory consumption hugely increases, and the pipelines consume ~1.2GB each (it was less than 100MB before).

The memory consumption pattern is like follows:

Without Torch linked during element compilation:

memconsumptionnotorch

With Torch linked to the element when compiling the element:

memconsumptiontorch

In the last part, where there is a spike in memory consumption, it happens when releasing the resources.

So far, the experiment is just adding Torch as a dependency in the building system. I am not adding any code (even headers).

For reproducing the issue, I have prepared a toy code where we are able to reproduce the behaviour. In this toy code, I have prepared a couple of pass-thru elements (torchpassthru, cudapassthru): one with Torch linked-in (just that), and another one that uses a similar initialisation/termination as the CUDA code we are using.

The issue can be reproduced by doing:

gst-launch-1.0 videotestsrc ! "video/x-raw,width=640,height=480" ! cudapassthru ! torchpassthru ! fakesink silent=false -vvv

where the memory consumption only has a spike when closing the pipeline:

gst-launch-1.0 videotestsrc ! "video/x-raw,width=640,height=480" ! torchpassthru ! cudapassthru ! fakesink silent=false -vvv

where the memory consumption increases from tens of MB to one GB. There is also a spike when closing the pipeline.

The memory scales linearly just by placing more cudapassthru elements. In other words, by placing one, the consumption increases by 1 GB, placing two, it increases to 2GB, and so on. This behaviour does not happen when we only have cudapassthru elements, meaning that the memory keeps in the other of tens of MB.

I also wanted to mention that this behaviour does not happen when having the pipelines in multiple processes. For example:

# Terminal 1
gst-launch-1.0 videotestsrc ! "video/x-raw,width=640,height=480" ! torchpassthru !  fakesink silent=false -vvv

# Terminal 2
gst-launch-1.0 videotestsrc ! "video/x-raw,width=640,height=480" ! cudapassthru ! cudapassthru ! cudapassthru !  fakesink silent=false -vvv

I suspect that it can be something related to CUDA contexts. If so:

  1. is there a way to make PyTorch and CUDA share the context in some way?

In the beginning, I thought it was a PyTorch issue. However, looking at how the memory scales when I add more elements, I started to suspect that there is something else.

  1. any clue about why this behaviour can happen in CUDA?

I am using PyTorch 1.8 (from NVIDIA) on a Jetson TX2 with JP 4.5.1. The GStreamer version is the default one (1.14.x).

Thanks in advance for any clue you can provide about this issue.

gstreamer-pass-thru-torch.zip (17.8 KB)

Hi,

My initial thought is that the PyTorch depends on some heavy library and requires large memory when loading.
Could you try the pipeline with only the PyTorch component and share the memory consumption status with us?

Thanks.

Hi @AastaLLL
Thanks for your response. About your question, I profiled the memory consumption and have got the following results:

TorchConsumptionBars_Public

Thus, the impact of the consumption is when having either:

  1. The two elements together in the same GStreamer pipeline
  2. Torch linked to the CUDA element (like merging the two elements in a single one)

What is really weird is that the memory consumption explodes under the above conditions. Just linking Torch to the CUDA Element causes this effect, or even placing it in two separate elements but running in the same pipeline, this phenomenon happens.

Another thing I have found is that splitting the pipeline into two separate processes (one pipeline with Torch and another one with CUDA) does not lead to this behaviour.

Hi,

Sorry for the late update.

This issue looks also weird to us.
We are going to reproduce this issue internally. Will get back to you later.

Thanks.

Hi @AastaLLL

I have been digging into the issue these days. I have overcome the issue by just using the primary context provided by the Driver. Thanks to that, I have fixed the issue of the growing memory consumption and kept it under control. There are still some effects:

  1. There is still a delay when synchronising the context with a cudaFree(0). It takes 15 seconds more or less.
  2. The memory consumption ramps up to 1.5G. It keeps approximately constant as I add new CUDA elements.

For taking that decision, I have just taken the CUDA Driver API note about contexts. In this case, since GStreamer is not intended to be multi-process, I decided to follow a shared context approach.

I am still puzzled by the 1.5G memory consumption. However, it seems quite related to PyTorch though.

In few words, I have replaced the cuCtxCreate with cuCtxGetCurrent.

Good to know you find a way to fix this.
Thanks for the feedback.