I am fine-tuning the tacotron model (an Nvidia model no less!) in WSL using Nvidia 460.X drivers just released and Cuda 11. I’ve loaded all data INTO WSL so nothing is being loaded from my windows drives.
Pytorch and Cuda report that the GPU is available and being used. That said, I get 0% in Task Manager as far as GPU utilization goes.
Normally, on native on-the-metal ubuntu I get about 2 seconds per iteration on my RTX 2080 - but here I am getting about 24 seconds per iteration (16 now that I enabled fp16 - but it’s still a huge reduction - the difference between 1 day and 1 week of training)
Here is the dump for pytorch’s attempt to find the bottleneck. It seems to all be in backpropagation, so maybe some hardware bottleneck exists.
Pytorch Bottleneck Profiling
--------------------------------------------------------------------------------
Environment Summary
--------------------------------------------------------------------------------
PyTorch 1.6.0 compiled w/ CUDA 10.2
Running with Python 3.7 and CUDA 10.2.89
`pip list` truncated output:
numpy==1.18.1
torch==1.6.0
torchvision==0.6.1
--------------------------------------------------------------------------------
cProfile output
--------------------------------------------------------------------------------
3654582 function calls (3478563 primitive calls) in 21.693 seconds
Ordered by: internal time
List reduced from 15174 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 8.180 8.180 8.180 8.180 {method 'run_backward' of 'torch._C._EngineBase' objects}
89 3.847 0.043 3.847 0.043 {method 'cuda' of 'torch._C._TensorBase' objects}
3441 1.389 0.000 1.389 0.000 {built-in method cat}
1720 0.766 0.000 0.766 0.000 {built-in method lstm_cell}
860 0.704 0.001 1.696 0.002 /home/jovyan/work/tacotron2/model.py:43(get_alignment_energies)
2583 0.639 0.000 0.639 0.000 {method 'matmul' of 'torch._C._TensorBase' objects}
1720 0.598 0.000 0.598 0.000 {built-in method addmm}
2 0.487 0.244 0.487 0.244 /opt/conda/lib/python3.7/site-packages/numpy/linalg/linalg.py:1458(svd)
3794 0.352 0.000 0.352 0.000 {built-in method marshal.loads}
2 0.264 0.132 0.264 0.132 {method 'poll' of 'select.poll' objects}
1730 0.190 0.000 0.190 0.000 {built-in method dropout}
61 0.160 0.003 0.160 0.003 {method 'uniform_' of 'torch._C._TensorBase' objects}
19579 0.152 0.000 0.152 0.000 {built-in method posix.stat}
1 0.150 0.150 0.150 0.150 {built-in method lstm}
868 0.150 0.000 0.150 0.000 {built-in method conv1d}
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
CudnnRnnBackward 18.07% 156.983ms 18.07% 156.983ms 156.983ms NaN 0.000us 0.000us 1 []
_cudnn_rnn_backward 18.07% 156.956ms 18.07% 156.956ms 156.956ms NaN 0.000us 0.000us 1 []
lstm 15.69% 136.274ms 15.69% 136.274ms 136.274ms NaN 0.000us 0.000us 1 []
_cudnn_rnn 15.68% 136.226ms 15.68% 136.226ms 136.226ms NaN 0.000us 0.000us 1 []
t 4.68% 40.689ms 4.68% 40.689ms 40.689ms NaN 0.000us 0.000us 1 []
AddBackward0 4.64% 40.334ms 4.64% 40.334ms 40.334ms NaN 0.000us 0.000us 1 []
uniform_ 3.94% 34.248ms 3.94% 34.248ms 34.248ms NaN 0.000us 0.000us 1 []
add 2.64% 22.934ms 2.64% 22.934ms 22.934ms NaN 0.000us 0.000us 1 []
MmBackward 2.63% 22.861ms 2.63% 22.861ms 22.861ms NaN 0.000us 0.000us 1 []
t 2.60% 22.556ms 2.60% 22.556ms 22.556ms NaN 0.000us 0.000us 1 []
transpose 2.60% 22.547ms 2.60% 22.547ms 22.547ms NaN 0.000us 0.000us 1 []
as_strided 2.56% 22.256ms 2.56% 22.256ms 22.256ms NaN 0.000us 0.000us 1 []
uniform_ 2.19% 19.002ms 2.19% 19.002ms 19.002ms NaN 0.000us 0.000us 1 []
PackPaddedSequenceBackward 2.01% 17.458ms 2.01% 17.458ms 17.458ms NaN 0.000us 0.000us 1 []
_pack_padded_sequence_backward 2.01% 17.452ms 2.01% 17.452ms 17.452ms NaN 0.000us 0.000us 1 []
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
Self CPU time total: 868.775ms
CUDA time total: 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
stack 15.01% 2.345s 15.01% 2.345s 2.345s 15.01% 2.345s 2.345s 1 []
stack 13.53% 2.113s 13.53% 2.113s 2.113s 13.53% 2.113s 2.113s 1 []
stack 12.94% 2.020s 12.94% 2.020s 2.020s 12.93% 2.020s 2.020s 1 []
cat 8.11% 1.266s 8.11% 1.266s 1.266s 8.11% 1.266s 1.266s 1 []
_cat 8.10% 1.266s 8.10% 1.266s 1.266s 8.11% 1.266s 1.266s 1 []
cat 8.03% 1.254s 8.03% 1.254s 1.254s 8.03% 1.254s 1.254s 1 []
_cat 8.03% 1.254s 8.03% 1.254s 1.254s 8.03% 1.254s 1.254s 1 []
cat 5.42% 846.588ms 5.42% 846.588ms 846.588ms 5.42% 846.696ms 846.696ms 1 []
_cat 5.42% 846.410ms 5.42% 846.410ms 846.410ms 5.42% 846.608ms 846.608ms 1 []
StackBackward 2.76% 430.782ms 2.76% 430.782ms 430.782ms 2.76% 430.784ms 430.784ms 1 []
unbind 2.75% 429.524ms 2.75% 429.524ms 429.524ms 2.75% 429.512ms 429.512ms 1 []
StackBackward 2.70% 422.450ms 2.70% 422.450ms 422.450ms 2.70% 422.448ms 422.448ms 1 []
unbind 2.70% 421.451ms 2.70% 421.451ms 421.451ms 2.70% 421.440ms 421.440ms 1 []
PackPaddedSequenceBackward 2.25% 352.125ms 2.25% 352.125ms 352.125ms 2.25% 352.112ms 352.112ms 1 []
_pack_padded_sequence_backward 2.25% 351.999ms 2.25% 351.999ms 351.999ms 2.25% 352.000ms 352.000ms 1 []
---------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------------------------------------
Self CPU time total: 15.619s
CUDA time total: 15.620s
Which seems to imply that the Backwards
calls are the bulk of it.