Some metric values don't make sense

Hi,

I came across two cases where the metric value returned by Nsight does not make sense to me.

Case I: Value of smsp__inst_executed_pipe_tensor.sum is greater than smsp__inst_executed.sum for the same kernel.

I was profiling a Resnet50 mixed-precision training iteration using the implementation from: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Classification/RN50v1.5

The command I used to run the job:

python main.py --arch resnet50 -c fanin --label-smoothing 0.1 -b 40 --fp16 --static-loss-scale 256 --training-only --epochs 1 /dataset/

I inserted torch.cuda.profiler.start()/stop() to profile one iteration of the training process (one pass of batch) which is one iteration of the for loop in image_classification/training.py:train. I believe this issue is independent of which iteration I profiled.

Two kernels (dgrad_engine and wgrad_alg0_engine) have this behaviour. However, not all instances of these kernels within the iteration have this problem. Here’s an example where the value does not make sense:

"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value" 

"320","582","python3.6","127.0.0.1","dgrad_engine","2019-Jun-04 17:34:14","1","7","Instruction count for different pipelines","EXEC","inst","86834903" 

"320","582","python3.6","127.0.0.1","dgrad_engine","2019-Jun-04 17:34:14","1","7","Instruction count for different pipelines","Tensor","inst","18446744073709551582"

The instruction count for Tensor core looks like a bogus value to me.

Other than smsp__inst_executed.sum and smsp__inst_executed_pipe_tensor.sum, my section file for nsight compute also includes metrics for alu, fma, fp16, fp64, spu and lsu but values of the other metrics look OK to me.

Case II: lts__t_sector_hit_rate.pct returns a value greater than 100%.

I found this issue by profiling the same application as case I but for FP32 training only. The command to run the job:

python main.py --arch resnet50 -c fanin --label-smoothing 0.1 -b 40 --training-only --epochs 1 /dataset/

Here’s an example of the problematic profiled output:

"ID","Process ID","Process Name","Host Name","Kernel Name","Kernel Time","Context","Stream","Section Name","Metric Name","Metric Unit","Metric Value" 

"0","1264","python3.6","127.0.0.1","implicit_convolve_sgemm","2019-Jun-06 01:23:06","1","7","Memory utilization related data","L2 Hit Rate","%","103.80"

How does L2 hit rate (lts__t_sector_hit_rate.pct) go above 100%?

Common system setup:
GPU: RTX 2070 (single GPU training)
Nvidia driver: 430.14
Docker image: nvcr.io/nvidia/pytorch:19.05-py3
Nsight compute version: NsightCompute-2019.3

Any explanation would be greatly appreciated. Thanks!