Tegra X1 half precision data with PyTorch

I am using PyTorch with my Tegra X1 GPU, and running the command line tool nvprof to do some kernel profiling of DCNN layers.
I am seeing, as reported by nvprof, that the half precision (16-bit) workloads have just as many (actually slightly more) FLOPS than when the layers are run with full precision (32-bit). However, about half as much data is being fetched from off-chip. The kernel durations generally seem to be higher for 16-bit also.
My conclusion is that there is some sort of bit packing occuring with the torch.half() data.
Could someone help explain what is going on here?