Question about Interpretation of CROP unit throughput metric

While profiling my application, I have found that using output image pixel format of RGBA 32bit floating point really slows down the performance about 4 times, compared to using 16bit floating point.(I know it is a terrible idea to use 32bit floating point as output image for usual graphics application, but it’s another story)
Strangely, every units throughputs(not only unit throughputs, but as L2 bandwidth things) are shown to be really underutilized in fp32 case, at least according to Nsight graphics GPU profiling.

Now, I am suspecting that CROP contains separated fp16 unit and fp32 unit, which leads to poor performance when using 32bit blending.(or at least CROP color compression unit things…) However, CROP also shows poor unit utilization in Nsight graphics.

Now, this is a main question. How should I interpret the number given by unit throughput? Is Nsight graphics are giving some misleading number about unit throughput and CROP fp32 operation is actual bound? Or should I look for other reasons to explain this result?

Best regards,

Hi jasonkim2,

It’s hard to say based on your description. What’s your environment (Nsight Graphics version, GPU, Driver, OS, etc.) I am not sure which activity of Nsight Graphics you are using. If you are using GPUTrace Activity, maybe you can check the online doc.