CUDA error in PyTorch Training when using nv-nsight-cu-cli

Hi, I tried to nv-nsight-cu-cli to get detailed profiles for a PyTorch training process. The version of nv-nsight-cu-cli is 2019.4.0 and the CUDA version is 10.0. My command line is:

nv-nsight-cu-cli -o layerwise0 -f --csv --profile-from-start off /home/jxt/anaconda3/envs/pytorch/bin/python test.py

There is no problem when the python code is running without nv-nsight-cu-cli. However, I found that if I set the --profile-from-start to “off”, there will be a CUDA error when computing “ReLU”:

Exception in thread Thread-2:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self.kwargs)
File “test.py”, line 217, in training_process
train(train_loader, r, epoch, batch_start_time)
File “test.py”, line 204, in train
r.run_forward(inputs, labels, scale=scale, last_batch=last_batch)
File “/home/jxt/test/runtime/runtime.py”, line 227, in run_forward
output_tensors = eachmodule.module()(eachmodule.input_tensors)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/test/models/resnext101_32x16d/gpus=3/stage0.py”, line 118, in forward
out3 = self.layer3(out2)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/activation.py”, line 94, in forward
return F.relu(input, inplace=self.inplace)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py”, line 1061, in relu
result = torch.relu
(input)
RuntimeError: CUDA error: an illegal memory access was encountered

If I set the --profile-from-start to “o”, there will be another CUDA error:

Exception in thread Thread-2:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self.kwargs)
File “test.py”, line 217, in training_process
train(train_loader, r, epoch, batch_start_time)
File “test.py”, line 204, in train
r.run_forward(inputs, labels, scale=scale, last_batch=last_batch)
File “/home/jxt/test/runtime/runtime.py”, line 227, in run_forward
output_tensors = eachmodule.module()(eachmodule.input_tensors)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/test/models/resnext101_32x16d/gpus=3/stage0.py”, line 118, in forward
out3 = self.layer3(out2)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/activation.py”, line 94, in forward
return F.relu(input, inplace=self.inplace)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py”, line 1061, in relu
result = torch.relu
(input)
RuntimeError: CUDA error: an illegal memory access was encountered

Exception in thread Thread-3:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self._kwargs)
File “test.py”, line 155, in load_and_receive
r.load_new_config(rank=args.rank, model=model, config=config)
File “/home/jxt/test/runtime/runtime.py”, line 393, in load_new_config
is_first=is_first, is_last=is_last, inputs=inputs, outputs=outputs))
File “/home/jxt/test/runtime/runtime.py”, line 33, in init
self._module.cuda()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 307, in cuda
return self._apply(lambda t: t.cuda(device))
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 203, in _apply
module._apply(fn)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 225, in _apply
param_applied = fn(param)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 307, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: an illegal memory access was encountered

I wonder why this would happen? How can I get the profile I want? Thank you!

From your description, the issue could be caused by a known issue of Nsight Compute 2019.4 with PyTorch > 19.07, or by a specific metric collected via SW-patching, or by an inherent issue with your application.

The first thing I would recommend to try is to use the latest Nsight Compute 2020.1, for which we resolved known issues with newer PyTorch versions: https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#updates-2020-1

ncu -o layerwise0 -f --csv --profile-from-start off /home/jxt/anaconda3/envs/pytorch/bin/python test.py

Note that in that version of the tool, we don’t collect the “full” set of curated metrics by default anymore, as we do in 2019.4. To change that, you can select and combine metrics, sections or section sets as needed. To collect the full curated set, simply pass the “–set full” option.

ncu -o layerwise0 -f --csv --profile-from-start off --set full /home/jxt/anaconda3/envs/pytorch/bin/python test.py

If you are still seeing issues with this newer version of the tool, I recommend selecting fewer metrics/sections, to see if any one is causing the illegal memory access, e.g. try the following commands

ncu -o layerwise0 -f --csv --profile-from-start off --metrics gpc__cycles_elapsed.max /home/jxt/anaconda3/envs/pytorch/bin/python test.py
ncu -o layerwise0 -f --csv --profile-from-start off --section SpeedOfLight /home/jxt/anaconda3/envs/pytorch/bin/python test.py

To ensure that there are no inherent memory access issues with your application which manifest when running under the profiler, run the application through cuda-memcheck’s/compute-sanitizer’s memcheck and racecheck tools.
https://docs.nvidia.com/cuda/compute-sanitizer/index.html

Finally, please make sure to use a recent driver if possible.

1 Like

Thank you so much for your advice! I have updated Nsight Compute to the latest version and there is no CUDA error now. However, another error occurred:

==ERROR== Error: ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM

I do not have the sudo permission or the CAP_SYS_ADMIN capability. Is there any other way to avoid this problem? Thank you.

These added security restrictions are introduced by newer driver versions. You can refer to https://developer.nvidia.com/ERR_NVGPUCTRPERM for all the details.

If you upgraded the driver as part of moving to Nsight Compute 2020.1, you should also have the permissions to enable performance counter collection for non-root users, as described on the page.

If you didn’t move to a newer driver, than this restriction existed beforehand already (and was likely the one causing the crash), but the older Nsight Compute didn’t have the capability to properly report it, yet. In this case, you will need to check with the system admin to get those permissions.

1 Like

Thank you for your reply. In my case it is not possible for me to check with the admin to get those permissions. I think it would be more convenient for users if these permissions were not required. Again, thank you so much for your help!