Hi,
I came across two cases where the metric value returned by Nsight does not make sense to me.
Case I: Value of smsp__inst_executed_pipe_tensor.sum
is greater than smsp__inst_executed.sum
for the same kernel.
I was profiling a Resnet50 mixed-precision training iteration using the implementation from: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/RN50v1.5
The command I used to run the job:
python main.py --arch resnet50 -c fanin --label-smoothing 0.1 -b 40 --fp16 --static-loss-scale 256 --training-only --epochs 1 /dataset/
I inserted torch.cuda.profiler.start()/stop()
to profile one iteration of the training process (one pass of batch) which is one iteration of the for loop in image_classification/training.py:train. I believe this issue is independent of which iteration I profiled.
Two kernels (dgrad_engine and wgrad_alg0_engine) have this behaviour. However, not all instances of these kernels within the iteration have this problem. Here’s an example where the value does not make sense:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"320","582","python3.6","127.0.0.1","dgrad_engine","2019-Jun-04 17:34:14","1","7","Instruction count for different pipelines","EXEC","inst","86834903"
"320","582","python3.6","127.0.0.1","dgrad_engine","2019-Jun-04 17:34:14","1","7","Instruction count for different pipelines","Tensor","inst","18446744073709551582"
The instruction count for Tensor core looks like a bogus value to me.
Other than smsp__inst_executed.sum
and smsp__inst_executed_pipe_tensor.sum
, my section file for nsight compute also includes metrics for alu, fma, fp16, fp64, spu and lsu but values of the other metrics look OK to me.
Case II: lts__t_sector_hit_rate.pct
returns a value greater than 100%.
I found this issue by profiling the same application as case I but for FP32 training only. The command to run the job:
python main.py --arch resnet50 -c fanin --label-smoothing 0.1 -b 40 --training-only --epochs 1 /dataset/
Here’s an example of the problematic profiled output:
"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value"
"0","1264","python3.6","127.0.0.1","implicit_convolve_sgemm","2019-Jun-06 01:23:06","1","7","Memory utilization related data","L2 Hit Rate","%","103.80"
How does L2 hit rate (lts__t_sector_hit_rate.pct) go above 100%?
Common system setup:
GPU: RTX 2070 (single GPU training)
Nvidia driver: 430.14
Docker image: nvcr.io/nvidia/pytorch:19.05-py3
Nsight compute version: NsightCompute-2019.3
Any explanation would be greatly appreciated. Thanks!