Question about how to know the (PTX) hsub2 and (PTX) vsub4 latency

I want to know sub.fp16x2 and vsub4 PTX instruction latency.
Is there any document for describing the value? or Is there any method to measure the instruction latency?

As can be seen from the generated SASS (machine code), vsub4.u32.u32.u32 is emulated using four LOP3.LUT and one IADD. You can likewise check the implementation of sub.fp16x2 by inspecting the output of cuobjdump --dump-sass.

Do you need the actual latency or the throughput? GPUs are throughput-oriented architectures, so one usually only needs the latter: vsub4 has 1/5 the throughput of simple integer instructions. The throughput of various instruction classes is listed in the CUDA Programming Guide, and depends on GPU architecture.

1 Like