Low performance in Pytorch on WSL2 with 460.x drivers and cuda 11 in Docker (Pytorch Bottle Profile Included!)

I am fine-tuning the tacotron model (an Nvidia model no less!) in WSL using Nvidia 460.X drivers just released and Cuda 11. I’ve loaded all data INTO WSL so nothing is being loaded from my windows drives.

Pytorch and Cuda report that the GPU is available and being used. That said, I get 0% in Task Manager as far as GPU utilization goes.

Normally, on native on-the-metal ubuntu I get about 2 seconds per iteration on my RTX 2080 - but here I am getting about 24 seconds per iteration (16 now that I enabled fp16 - but it’s still a huge reduction - the difference between 1 day and 1 week of training)

Here is the dump for pytorch’s attempt to find the bottleneck. It seems to all be in backpropagation, so maybe some hardware bottleneck exists.

Pytorch Bottleneck Profiling

--------------------------------------------------------------------------------
  Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.6.0 compiled w/ CUDA 10.2
Running with Python 3.7 and CUDA 10.2.89

`pip list` truncated output:
numpy==1.18.1
torch==1.6.0
torchvision==0.6.1
--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         3654582 function calls (3478563 primitive calls) in 21.693 seconds

   Ordered by: internal time
   List reduced from 15174 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    8.180    8.180    8.180    8.180 {method 'run_backward' of 'torch._C._EngineBase' objects}
       89    3.847    0.043    3.847    0.043 {method 'cuda' of 'torch._C._TensorBase' objects}
     3441    1.389    0.000    1.389    0.000 {built-in method cat}
     1720    0.766    0.000    0.766    0.000 {built-in method lstm_cell}
      860    0.704    0.001    1.696    0.002 /home/jovyan/work/tacotron2/model.py:43(get_alignment_energies)
     2583    0.639    0.000    0.639    0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
     1720    0.598    0.000    0.598    0.000 {built-in method addmm}
        2    0.487    0.244    0.487    0.244 /opt/conda/lib/python3.7/site-packages/numpy/linalg/linalg.py:1458(svd)
     3794    0.352    0.000    0.352    0.000 {built-in method marshal.loads}
        2    0.264    0.132    0.264    0.132 {method 'poll' of 'select.poll' objects}
     1730    0.190    0.000    0.190    0.000 {built-in method dropout}
       61    0.160    0.003    0.160    0.003 {method 'uniform_' of 'torch._C._TensorBase' objects}
    19579    0.152    0.000    0.152    0.000 {built-in method posix.stat}
        1    0.150    0.150    0.150    0.150 {built-in method lstm}
      868    0.150    0.000    0.150    0.000 {built-in method conv1d}


--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
Name                                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                                   
----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
CudnnRnnBackward                    18.07%           156.983ms        18.07%           156.983ms        156.983ms        NaN              0.000us          0.000us          1                []                                             
_cudnn_rnn_backward                 18.07%           156.956ms        18.07%           156.956ms        156.956ms        NaN              0.000us          0.000us          1                []                                             
lstm                                15.69%           136.274ms        15.69%           136.274ms        136.274ms        NaN              0.000us          0.000us          1                []                                             
_cudnn_rnn                          15.68%           136.226ms        15.68%           136.226ms        136.226ms        NaN              0.000us          0.000us          1                []                                             
t                                   4.68%            40.689ms         4.68%            40.689ms         40.689ms         NaN              0.000us          0.000us          1                []                                             
AddBackward0                        4.64%            40.334ms         4.64%            40.334ms         40.334ms         NaN              0.000us          0.000us          1                []                                             
uniform_                            3.94%            34.248ms         3.94%            34.248ms         34.248ms         NaN              0.000us          0.000us          1                []                                             
add                                 2.64%            22.934ms         2.64%            22.934ms         22.934ms         NaN              0.000us          0.000us          1                []                                             
MmBackward                          2.63%            22.861ms         2.63%            22.861ms         22.861ms         NaN              0.000us          0.000us          1                []                                             
t                                   2.60%            22.556ms         2.60%            22.556ms         22.556ms         NaN              0.000us          0.000us          1                []                                             
transpose                           2.60%            22.547ms         2.60%            22.547ms         22.547ms         NaN              0.000us          0.000us          1                []                                             
as_strided                          2.56%            22.256ms         2.56%            22.256ms         22.256ms         NaN              0.000us          0.000us          1                []                                             
uniform_                            2.19%            19.002ms         2.19%            19.002ms         19.002ms         NaN              0.000us          0.000us          1                []                                             
PackPaddedSequenceBackward          2.01%            17.458ms         2.01%            17.458ms         17.458ms         NaN              0.000us          0.000us          1                []                                             
_pack_padded_sequence_backward      2.01%            17.452ms         2.01%            17.452ms         17.452ms         NaN              0.000us          0.000us          1                []                                             
----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
Self CPU time total: 868.775ms
CUDA time total: 0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

        Because the autograd profiler uses the CUDA event API,
        the CUDA time column reports approximately max(cuda_time, cpu_time).
        Please ignore this output if your code does not use CUDA.

----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
Name                                Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                                   
----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
stack                               15.01%           2.345s           15.01%           2.345s           2.345s           15.01%           2.345s           2.345s           1                []                                             
stack                               13.53%           2.113s           13.53%           2.113s           2.113s           13.53%           2.113s           2.113s           1                []                                             
stack                               12.94%           2.020s           12.94%           2.020s           2.020s           12.93%           2.020s           2.020s           1                []                                             
cat                                 8.11%            1.266s           8.11%            1.266s           1.266s           8.11%            1.266s           1.266s           1                []                                             
_cat                                8.10%            1.266s           8.10%            1.266s           1.266s           8.11%            1.266s           1.266s           1                []                                             
cat                                 8.03%            1.254s           8.03%            1.254s           1.254s           8.03%            1.254s           1.254s           1                []                                             
_cat                                8.03%            1.254s           8.03%            1.254s           1.254s           8.03%            1.254s           1.254s           1                []                                             
cat                                 5.42%            846.588ms        5.42%            846.588ms        846.588ms        5.42%            846.696ms        846.696ms        1                []                                             
_cat                                5.42%            846.410ms        5.42%            846.410ms        846.410ms        5.42%            846.608ms        846.608ms        1                []                                             
StackBackward                       2.76%            430.782ms        2.76%            430.782ms        430.782ms        2.76%            430.784ms        430.784ms        1                []                                             
unbind                              2.75%            429.524ms        2.75%            429.524ms        429.524ms        2.75%            429.512ms        429.512ms        1                []                                             
StackBackward                       2.70%            422.450ms        2.70%            422.450ms        422.450ms        2.70%            422.448ms        422.448ms        1                []                                             
unbind                              2.70%            421.451ms        2.70%            421.451ms        421.451ms        2.70%            421.440ms        421.440ms        1                []                                             
PackPaddedSequenceBackward          2.25%            352.125ms        2.25%            352.125ms        352.125ms        2.25%            352.112ms        352.112ms        1                []                                             
_pack_padded_sequence_backward      2.25%            351.999ms        2.25%            351.999ms        351.999ms        2.25%            352.000ms        352.000ms        1                []                                             
----------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------------------------------------  
Self CPU time total: 15.619s
CUDA time total: 15.620s

Which seems to imply that the Backwards calls are the bulk of it.