For completeness, you don’t have to use the .per_cycle_elapsed sub-metric if you really want the pure sum. In this case, using the .sum sub-metric smsp__sass_thread_inst_executed_op_dadd_pred_on.sum is sufficient, and no further calculations are necessary.
Similarly, instead of computing smsp__cycles_elapsed.avg.per_second * kernel_duration, you can use smsp__cycles_elapsed.max (or better gpc__cycles_elapsed.max) directly, to get the number of cycles of the longest-executing unit instance, which thereby determines the number of cycles for the entire kernel.