unsure why a float16 operation fails after using memory for same float32 operation on ppc64le

my environment : Power 9 with Volta GPUs, cuda 9.1 cudnn 7.0.5
I’m running a PyTorch test called test_Conv2d_groups_nobias on different size tensors of float16 and float32. The test first runs using float32, and then runs the same test using float16. The float32 test passes , but the float16 test fails with a precision failure.

But if I reverse the order of which size is tested first (i.e. run test on float16 first then float32) the test passes.

Also if I run the test with float32 on cuda device 0 and then run the test with float16 on device 1 the test passes.

My conclusion here is that gpu memory has garbage left over that is affecting the test when using float16 after running the same test using float32.

This same test does not happen on Power8 with Pascal GPU with cuda 9.1 and cudnn 7.0.4

You may want to file a bug at developer.nvidia.com

The best chance for success with the bug is if you give a very precise description of the machine setup and the test setup, and the sequence to reproduce your issue.

thank you, I opened a bug report: https://developer.nvidia.com/nvidia_bug/2115934