Hi! I’m running the following script on a Jetson AGX Orin 64GB dev kit via the nemo jetson container and I’m getting a “RuntimeError: CUDA driver error: out of memory”. Any ideas?
When I catch the exception and log memory usage (with torch.cuda.memory…), it’s no where near the 60GB available (nothing else is running). Also, monitoring with jtop shows less than15G usage the whole time it’s running (so 45GB free). Also, the audio file is only 5 seconds mono 16bit @ 16kHz.
# Load Canary model
from nemo.collections.asr.models import EncDecMultiTaskModel
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
# Transcribe
transcript = canary_model.transcribe(audio=["path_to_audio_file.wav"])
I’ve tried:
- allocating an empty 32GB tensor in the container allocates without issue - to check it was possible to allocate that amount of memory in the container.
canary_model = canary_model.half()
and a 0.1s audio file as input gets transcribe completing and outputting a transcript with no OOM error.- batch_size = 1 doesn’t appear to help because I still get the OOM error.
Full trace:
Traceback (most recent call last):
File "/data/nemo/reserve.py", line 70, in <module>
predicted_text = canary_model.transcribe(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 502, in transcribe
return super().transcribe(audio=audio, override_config=trcfg)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/mixins/transcription.py", line 263, in transcribe
for processed_outputs in generator:
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/mixins/transcription.py", line 373, in transcribe_generator
model_outputs = self._transcribe_forward(test_batch, transcribe_cfg)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 842, in _transcribe_forward
log_probs, encoded_len, enc_states, enc_mask = self.forward(input_signal=audio, input_signal_length=audio_lens)
File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1081, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/aed_multitask_models.py", line 646, in forward
processed_signal, processed_signal_length = self.preprocessor(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py", line 1081, in wrapped_call
outputs = wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py", line 101, in forward
processed_signal, processed_length = self.get_features(input_signal.to(torch.float32), length)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/audio_preprocessing.py", line 301, in get_features
return self.featurizer(input_signal, length)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/parts/preprocessing/features.py", line 443, in forward
x = torch.sqrt(x.pow(2).sum(-1) + guard)
RuntimeError: CUDA driver error: out of memory
Allocated memory: 3922.57 MB
Reserved memory: 3934.00 MB
Max memory allocated: 3926.39 MB
Max memory reserved: 3934.00 MBPreformatted text