Nemo > Canary 1B > RuntimeError: CUDA driver error: out of memory

Hi! I’m running the following script on a Jetson AGX Orin 64GB dev kit via the nemo jetson container and I’m getting a “RuntimeError: CUDA driver error: out of memory”. Any ideas?

When I catch the exception and log memory usage (with torch.cuda.memory…), it’s no where near the 60GB available (nothing else is running). Also, monitoring with jtop shows less than15G usage the whole time it’s running (so 45GB free). Also, the audio file is only 5 seconds mono 16bit @ 16kHz.

# Load Canary model

from nemo.collections.asr.models import EncDecMultiTaskModel

canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')



# Transcribe

transcript = canary_model.transcribe(audio=["path_to_audio_file.wav"])

I’ve tried:

  • allocating an empty 32GB tensor in the container allocates without issue - to check it was possible to allocate that amount of memory in the container.
  • canary_model = canary_model.half() and a 0.1s audio file as input gets transcribe completing and outputting a transcript with no OOM error.
  • batch_size = 1 doesn’t appear to help because I still get the OOM error.

Full trace:

Traceback (most recent call last):
  File "/data/nemo/reserve.py", line 70, in <module>
    predicted_text = canary_model.transcribe(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 502, in transcribe
    return super().transcribe(audio=audio, override_config=trcfg)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/mixins/transcription.py", line 263, in transcribe
    for processed_outputs in generator:
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/mixins/transcription.py", line 373, in transcribe_generator
    model_outputs = self._transcribe_forward(test_batch, transcribe_cfg)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 842, in _transcribe_forward
    log_probs, encoded_len, enc_states, enc_mask = self.forward(input_signal=audio, input_signal_length=audio_lens)
  File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1081, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 646, in forward
    processed_signal, processed_signal_length = self.preprocessor(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1081, in wrapped_call
    outputs = wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py", line 101, in forward
    processed_signal, processed_length = self.get_features(input_signal.to(torch.float32), length)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py", line 301, in get_features
    return self.featurizer(input_signal, length)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/preprocessing/features.py", line 443, in forward
    x = torch.sqrt(x.pow(2).sum(-1) + guard)
RuntimeError: CUDA driver error: out of memory

Allocated memory: 3922.57 MB
Reserved memory: 3934.00 MB
Max memory allocated: 3926.39 MB
Max memory reserved: 3934.00 MBPreformatted text

Hi,

Please try to increase the memory amount as below comment to see if it works.

Thanks

I may be missing something but that would restrict the memory amount rather than increase it no?

Currently the nemo container has 60GB of memory (+32GB swap) available to it and adding those parameters to docker would limit it to 500MB of memory and 7.5GB of swap.

I was able to transcribe a 30s audio file on the hugging face canary space without any issue, and it’s running on a T4 with 16GB of RAM and 16GB of VRAM. That’s 50% of the memory resources my orin dev kit has. Could this indicate that the Out of Memory (OOM) error I encountered is a misdiagnosis?

Hi,

Swap is not a GPU-accessible memory.

Could you check the system status with tegrastats first?
The 64GB memory is shared between the CPU and GPU and some might be occupied by the CPU for the OS.

Thanks.

In that case, could you clarify why you recommended following instructions to “try adding --memory=500M --memory-swap=8G to your docker run script”.

I had checked the system status earlier and mentioned there was plenty of free memory available…

Hi,

When you run a PyTorch application, not all the memory is used for GPU tasks.
So passing more memory might help in some use cases.

Could you try to run the application without using the container?
This will help to figure out if this is related to the container resource access.

More, could you also concurrently monitor the device status to see if the memory is fully occupied?

Thanks.

I concurrently monitored the device and it’s not fully occupied. Not even close.

I can try to run it outside the container.

One thing I tried already was allocating empty tensors adding up to 32GB of memory and that worked fine within the container. That would be memory associated with GPU tasks - is that correct? I could try to allocate 32GB via a CPU task - is that worth trying?

Hi,

GPU memory can be allocated via cudaMalloc() directly.
Would you mind sharing the testing code with us so we can try the same internally to gather more information.

Thanks.

Sure it’s actually NVIDIA’s code that I was trying:

To be honest, I had to give up on the Jetson for now and choose a cloud based approach to my problem. It was very time consuming to get going to begin with (upgrading to Jetpack 6 wasn’t straightforward) and I kept running into strange issues like the one in this thread that cost me days of dev time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.