PyTorch/CUDA half precision vs full precision FLOPS

I am comparing a half precision workload (tensors are torch.HalfTensor) to a full precision workload (tensors are torch.FloatTensor) by using nvprof to profile my PyTorch model.

I noticed that even when running the model with half precision tensors, nvprof reports that only single precision operations happened (rather than half), albeit many less FLOPS than occured with the full precision model.

My suspicion is that there are some bitpacking optimizations made somewhere in the pytorch/cuda library bindings.

I’d like to confirm this, or if thats not right, hear what the explanation is. Thanks!