I am working on maximizing the performance of a kernel which does element-wise multiplication of a large matrix. From this link https://en.wikipedia.org/wiki/FLOPS it seems that most modern GPU architectures (including the one I’m using) are capable of preforming 2 floating point operations every cycle (for 32-bit precision). To take advantage of this, does NVCC automatically generate assembly code which uses this functionality, or is it something that needs to be programmed into the CUDA code? For example, in a kernel which does two separate multiplications one after the other, will the generated ptx code do both multiplications simultaneously? Or is that something that needs to be specified in the CUDA code?