Thank you for your help.
It has taken me a while to understand the tool and learn to interpret the results. It is still curious to me that when I run a profile like that only ReLu operations seems to have a level over 0.
Is there a way of identifying which Kernels of those listed in the log are actually calling Tensor cores? I haven’t found any documentation related to the kernels names so I am a bit lost here.
On the other hand, I have also been trying to run the system-profiling flag while profiling my application but I keep getting a warning saying that the underlying platform is not compatible. Aren’t the Jetson Boards compatible with this option? If so, is there a way can I relate the power consumption to the profiler?
Thank you in advance.
Ignacio
profile sample
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1, int=4>, fused::KpqkPtrWriter<float, int=1, int=1, int=4>, float, float, int=6, int=5, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, __half, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,3,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=2, int=2>, fused::KpqkPtrWriter<float, int=1, int=1, int=2>, float, float, int=4, int=7, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=2, int=1Type>)”,5,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void CUTENSOR_NAMESPACE::vectorized_tensor_elementwise_kernel<CUTENSOR_NAMESPACE::pw_config_t, __half, float, float, float, bool=1, cutensorOperator_t=1, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t, cutensorOperator_t>(CUTENSOR_NAMESPACE::pw_params_t, int, int, unsigned int=1, int=32 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=256 const *, CUTENSOR_NAMESPACE::pw_params_t, unsigned int=1 const *, unsigned int=256 const **, cutensorOperator_t, void const *, cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const , cutensorOperator_t, void const )”,18,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”
“Xavier (0)”,“void cuInt8::nchwToNchhw2<__half>(__half const , __half, int, int, int, int, int, int, cuInt8::ReducedDivisorParameters)”,6,“tensor_precision_fu_utilization”,“Tensor-Precision Function Unit Utilization”,“Idle (0)”,“Idle (0)”,“Idle (0)”