Superlinear Scaling of HPCG

As part of a lab exercice we were to run the HPCG Benchmark (version 3.1 with Cuda 9) on nodes with 2 Tesla K20 GPUs.
Everyone in the class got about the same result:
With one GPU used we achieved about 19GFlops, with both GPUS 50GFlops i.e. 25 GFlops/Card.
As the memory requirement per GPU is independent of GPU count with HPCG we were unable to find an explanation for this performance increase.