I want to know sub.fp16x2 and vsub4 PTX instruction latency.
Is there any document for describing the value? or Is there any method to measure the instruction latency?
As can be seen from the generated SASS (machine code), vsub4.u32.u32.u32
is emulated using four LOP3.LUT
and one IADD
. You can likewise check the implementation of sub.fp16x2
by inspecting the output of cuobjdump --dump-sass
.
Do you need the actual latency or the throughput? GPUs are throughput-oriented architectures, so one usually only needs the latter: vsub4
has 1/5 the throughput of simple integer instructions. The throughput of various instruction classes is listed in the CUDA Programming Guide, and depends on GPU architecture.
1 Like