Hi, I tried to nv-nsight-cu-cli to get detailed profiles for a PyTorch training process. The version of nv-nsight-cu-cli is 2019.4.0 and the CUDA version is 10.0. My command line is:
nv-nsight-cu-cli -o layerwise0 -f --csv --profile-from-start off /home/jxt/anaconda3/envs/pytorch/bin/python test.py
There is no problem when the python code is running without nv-nsight-cu-cli. However, I found that if I set the --profile-from-start to “off”, there will be a CUDA error when computing “ReLU”:
Exception in thread Thread-2:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self.kwargs)
File “test.py”, line 217, in training_process
train(train_loader, r, epoch, batch_start_time)
File “test.py”, line 204, in train
r.run_forward(inputs, labels, scale=scale, last_batch=last_batch)
File “/home/jxt/test/runtime/runtime.py”, line 227, in run_forward
output_tensors = eachmodule.module()(eachmodule.input_tensors)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/test/models/resnext101_32x16d/gpus=3/stage0.py”, line 118, in forward
out3 = self.layer3(out2)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/activation.py”, line 94, in forward
return F.relu(input, inplace=self.inplace)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py”, line 1061, in relu
result = torch.relu(input)
RuntimeError: CUDA error: an illegal memory access was encountered
If I set the --profile-from-start to “o”, there will be another CUDA error:
Exception in thread Thread-2:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self.kwargs)
File “test.py”, line 217, in training_process
train(train_loader, r, epoch, batch_start_time)
File “test.py”, line 204, in train
r.run_forward(inputs, labels, scale=scale, last_batch=last_batch)
File “/home/jxt/test/runtime/runtime.py”, line 227, in run_forward
output_tensors = eachmodule.module()(eachmodule.input_tensors)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/test/models/resnext101_32x16d/gpus=3/stage0.py”, line 118, in forward
out3 = self.layer3(out2)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 550, in call
result = self.forward(*input, **kwargs)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/activation.py”, line 94, in forward
return F.relu(input, inplace=self.inplace)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py”, line 1061, in relu
result = torch.relu(input)
RuntimeError: CUDA error: an illegal memory access was encountered
Exception in thread Thread-3:
Traceback (most recent call last):
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 916, in _bootstrap_inner
self.run()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py”, line 864, in run
self._target(*self._args, **self._kwargs)
File “test.py”, line 155, in load_and_receive
r.load_new_config(rank=args.rank, model=model, config=config)
File “/home/jxt/test/runtime/runtime.py”, line 393, in load_new_config
is_first=is_first, is_last=is_last, inputs=inputs, outputs=outputs))
File “/home/jxt/test/runtime/runtime.py”, line 33, in init
self._module.cuda()
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 307, in cuda
return self._apply(lambda t: t.cuda(device))
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 203, in _apply
module._apply(fn)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 225, in _apply
param_applied = fn(param)
File “/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 307, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: an illegal memory access was encountered
I wonder why this would happen? How can I get the profile I want? Thank you!