Reduce number of docker layers in nvhpc runtime images

Great job with the new nvhpc images; they clean up a ton of things I’ve had to by hand before in images with cuda-aware openmpi/etc.

The runtime images have a huge number of layers (~100) caused by a ton of discrete copies (docker history [nvcr.io/nvidia/nvhpc:20.9-runtime-cuda11.0-ubuntu20.04](http://nvcr.io/nvidia/nvhpc:20.9-runtime-cuda11.0-ubuntu20.04) has 100 layers). This makes it difficult to build on top of, since the max number of layers in docker with overlay2 is ~125.

It would be helpful if the runtime images were squashed to a single image. I can squash before putting things on top, but this feels unnecessary.

Thanks for the note. The 20.9 package contains some extraneous files in the REDIST directory causing he CUDA 11.0 runtime image to have ~40 more layers than the 10.1 or 10.2 runtime images. We expect to get this fixed in the next release and hopefully help with this issue.

In the meantime, squashing the image is a fine workaround however due to the loss of image caching, we don’t plan on squashing the images on NGC.