I do not have specific data types for roofline analysis. My goal is to use Nsight parameters to classify different kernels as either compute-intensive or memory-intensive. After classification, I plan to profile them by assigning various numbers of SMs (Streaming Multiprocessors). By calculating the approximate AI (Arithmetic Intensity) and comparing it to the ridge point of the system, I aim to estimate the minimum number of SMs needed for each memory-intensive kernel.
To achieve this, I need to obtain the achieved AI for different kernel types. However, I am encountering difficulties with compute-intensive kernels, as I am not getting valid AI values from my metrics, even when using full metrics in NVIDIA Nsight. Currently, I use the following metrics to calculate the AI of the kernel:
Time
metrics="sm__cycles_elapsed.avg,\
sm__cycles_elapsed.avg.per_second,"
DP
metrics+="sm__sass_thread_inst_executed_op_dadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_dfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_dmul_pred_on.sum,"
SP
metrics+="sm__sass_thread_inst_executed_op_fadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_ffma_pred_on.sum,\
sm__sass_thread_inst_executed_op_fmul_pred_on.sum,"
HP
metrics+="sm__sass_thread_inst_executed_op_hadd_pred_on.sum,\
sm__sass_thread_inst_executed_op_hfma_pred_on.sum,\
sm__sass_thread_inst_executed_op_hmul_pred_on.sum,"
Tensor Core
metrics+=“sm__inst_executed_pipe_tensor.sum,”
DRAM, L2 and L1
metrics+="dram__bytes.sum,\
lts__t_bytes.sum,\
l1tex__t_bytes.sum"
Can I use any other metrics for programs such as histogram, breadth-first search, and matrix transpose to calculate the approximate AI and classify them into memory-intensive and compute-intensive groups?