illegal memory access during nsight compute profiling

Background: I have tested Nvidia’s waveglow which you can see here following the methods under “Generate audio with our pre-existing model”.

I have tested it successfully on its own:

python -f <(ls mel_spectrograms/*.pt) -w -o . --is_fp16 -s 0.6

and have ran it with Nsight Systems profiler:

nsys profile python -f mel_spectrograms/ -w -o . --is_fp16 -s 0.6

They both ran successfully, and the results of Nsight Systems looked fine.

However, when I run with Nsight Compute:

nv-nsight-cu-cli -f path/to/python -f <(ls mel_spectrograms/*.pt) -w -o . --is_fp16 -s 0.6

I get:

==PROF== Profiling -    1: 0%....50%....100%
Traceback (most recent call last):
  File "", line 84, in <module>
    args.sampling_rate, args.is_fp16, args.denoiser_strength)
  File "", line 38, in main
    waveglow = waveglow.remove_weightnorm(waveglow)
  File "/home/msl/isaac/waveglow/", line 299, in remove_weightnorm
    WN.in_layers = remove(WN.in_layers)
  File "/home/msl/isaac/waveglow/", line 308, in remove
    old_conv = torch.nn.utils.remove_weight_norm(old_conv)
  File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/", line 113, in remove_weight_norm
  File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/", line 48, in remove
    weight = self.compute_weight(module)
  File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/utils/", line 18, in compute_weight
    return _weight_norm(v, g, self.dim)
RuntimeError: CUDA error: an illegal memory access was encountered
==PROF== Report: profile.nsight-cuprof-report
weight_norm_fwd_first_dim_kernel, 2019-Apr-10 17:47:31
Section: GPU Speed Of Light
---------------------------------------------------------------------- --------------- ------------------------------
Memory Frequency                                                                   Ghz                           6.47
SOL FB                                                                               %                           0.65
Elapsed Cycles                                                                   cycle                      11,676.75
SM Frequency                                                                       Ghz                           1.79
Memory [%]                                                                           %                           8.84
Duration                                                                       usecond                           6.53
SOL L2                                                                               %                           1.07
SOL TEX                                                                              %                           1.97
SM [%]                                                                               %                          18.25
---------------------------------------------------------------------- --------------- ------------------------------

Section: Compute Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Executed Ipc Active                                                         inst/cycle                           1.23
Executed Ipc Elapsed                                                        inst/cycle                           0.78
Issued Ipc Active                                                           inst/cycle                           1.25
Issue Slots Busy                                                                     %                          20.87
SM Busy                                                                              %                          18.25
---------------------------------------------------------------------- --------------- ------------------------------

Section: Memory Workload Analysis
---------------------------------------------------------------------- --------------- ------------------------------
Memory Throughput                                                         Gbyte/second                           1.26
Mem Busy                                                                             %                           8.84
Max Bandwidth                                                                        %                           6.41
L2 Hit Rate                                                                          %                          86.20
Mem Pipes Busy                                                                       %                          22.18
L1 Hit Rate                                                                          %                          71.99
---------------------------------------------------------------------- --------------- ------------------------------

Section: Scheduler Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Active Warps Per Scheduler                                                  warp/cycle                          12.07
Eligible Warps Per Scheduler                                                warp/cycle                           0.64
No Eligible                                                                          %                          60.10
Instructions Per Active Issue Slot                                          inst/issue                           1.09
Issued Warp Per Scheduler                                                  issue/cycle                           0.43
One or More Eligible                                                                 %                          42.74
---------------------------------------------------------------------- --------------- ------------------------------

Section: Warp State Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Not Predicated Off Threads Per Warp                                   thread/inst                          25.94
Avg. Active Threads Per Warp                                               thread/inst                          30.93
Warp Cycles Per Executed Instruction                                        cycle/inst                          26.11
Warp Cycles Per Issued Instruction                                          cycle/inst                          25.54
Warp Cycles Per Issue Active                                               cycle/issue                          27.72
---------------------------------------------------------------------- --------------- ------------------------------

Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler                                          inst                       2,262.40
Executed Instructions                                                             inst                        180,992
Avg. Issued Instructions Per Scheduler                                            inst                       2,312.15
Issued Instructions                                                               inst                        184,972
---------------------------------------------------------------------- --------------- ------------------------------

Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size                                                                                                        256
Grid Size                                                                                                         256
Registers Per Thread                                                   register/thread                             13
Shared Memory Configuration Size                                                 Kbyte                             48
Dynamic Shared Memory Per Block                                            Kbyte/block                              1
Static Shared Memory Per Block                                              byte/block                              0
Threads                                                                         thread                         65,536
Waves Per SM                                                                                                     1.60
---------------------------------------------------------------------- --------------- ------------------------------

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM                                                                   block                             32
Block Limit Registers                                                         register                             16
Block Limit Local Mem                                                             byte                             96
Block Limit Warps                                                                 warp                              8
Achieved Active Warps Per SM                                                warp/cycle                          53.04
Achieved Occupancy                                                                   %                          82.88
Theoretical Active Warps per SM                                             warp/cycle                             64
Theoretical Occupancy                                                                %                            100
---------------------------------------------------------------------- --------------- ------------------------------

and the program terminates due to illegal memory access.

Thank you, I logged a bug internally to track this. We will get back to you once we were able to reproduce it and have or need more information.

A very similar thing occured when I ran Tacotron2 ( with nv-nsight-cu-cli a few hours a go. It ran perfectly fine by itself – python – and with nsight systems – nsys profile python – but it failed with an error when ran with Nsight compute – nv-nsight-cu-cli python I wonder if I just have some local setting incorrect.

The error was:

Traceback (most recent call last):
  File "", line 23, in <module>
    mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
  File "/home/msl/isaac/tacotron2/", line 531, in inference
    encoder_outputs = self.encoder.inference(embedded_inputs)
  File "/home/msl/isaac/tacotron2/", line 200, in inference
    outputs, _ = self.lstm(x)
  File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/modules/", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/msl/.virtualenvs/venv_waveglow/lib/python3.5/site-packages/torch/nn/modules/", line 179, in forward
    self.dropout,, self.bidirectional, self.batch_first)
==PROF== Profiling -    1: 0%....50%....100%
==PROF== Profiling -    2: 0%....50%....100%
==PROF== Profiling -    3: 0%....50%....100%
==PROF== Profiling -    4: 0%....50%....100%
==PROF== Profiling -    5: 0%....50%....100%
==PROF== Profiling -    6: 0%....50%....100%
==PROF== Profiling -    7: 0%....50%....100%
==PROF== Profiling -    8: 0%....50%....100%
==PROF== Profiling -    9: 0%....50%....100%
==PROF== Profiling -   10: 0%....50%....100%
==PROF== Profiling -   11: 0%....50%....100%
==PROF== Profiling -   12: 0%....50%....100%
==PROF== Profiling -   13: 0%....50%....100%
==PROF== Profiling -   14: 0%....50%....100%
==PROF== Profiling -   15: 0%....50%....100%
==PROF== Profiling -   16: 0%....50%....100%
==PROF== Profiling -   17: 0%....50%....100%
==PROF== Profiling -   18: 0%....50%....100%
==PROF== Profiling -   19: 0%....50%....100%
==PROF== Report: profile.nsight-cuprof-report
indexSelectLargeIndex, 2019-Apr-10 18:13:36

I didn’t include the profiling that it printed out because it was too long

Hi Isaac,

I am able to reproduce the issue on my side on Ubuntu 16.04, a Pascal GTX1070 and Nsight Compute 2019.1.
While we are looking into the issue, could you please let me know your versions (OS, GPU, Nsight Compute), to see which products are affected?

Hi Felix,

OS: CentOS7
GPU: GTX1080 x 2
CUDA version: 10.0
Driver version: 410.79
Compute version: Version 2019.1.1 (Build 25827221)
python version: 3.5
Pytorch version: 0.4.0

It’s great to hear that you were able to reproduce this issue. I wonder if the scope of the bug is all of Python or just Pytorch.

Please let me know if you need anything else from me.

I wonder if the scope of the bug is all of Python or just Pytorch.

The bug is specific to Pytorch. It is an issue with patching certain kernels for profiling, when collecting the sass__* metrics, e.g. sass__warp_histogram.

Until the bug is fixed, you should be able to work around it by not collecting those metrics during profiling. You can do that by not collecting the “LaunchStats” section. In the UI, simply disable the section. On the command line, either select a different subset of sections using the “–section” command line flag, or e.g. remove those metrics from the .section file.

I have not tested it, but you might need to do the same for sass__inst_executed_per_opcode specified in the InstructionStats section.

Hi Felix,

Thanks for the workaround. I haven’t yet finessed with nsight compute cli to the extent, but I will give it a try.

Please let me know when it is fixed because this would hugely help my work routine.

We have reproduced this issue and have a fix that will make it into the next release of Nsight Compute.


do you have an idea approximately when the next release will be?


We do not disclose release dates here.

okay. Will follow the above instructions to run profiling then. Thanks!

I haven’t yet finessed with nsight compute cli to the extent, but I will give it a try.. .