I’ve got this code designed for multi-GPU use, and it uses cudaMemcpyAsync for async data transfer between devices. But what happens if I run this code with just 1 GPU when at compile time the compiler does not know how many GPUs will be used (this is supplied at run-time through a config file)?
Does the compiler consider this and place a branch in the compiled code so that at run time when the number of devices is known there is no call made to this cudaMemcpyAsync if only 1 GPU is made?