If I know the number of predicated threads. For example, M threads will run in kernel X.
Is it possible to count FLOPS of kernel X seeing the assembly code of the kernel and then multiplying it by M?
In other words, make an estimation of FLOPS from kernel X if M takes a certain value.
How can I get the assembly code of a CUDA kernel?