Why is that strange? Depending on your systems specifications, which you have not disclosed, data exchanged between GPUs may need to be routed through the host.
Generally speaking, peer-to-peer communication between GPUs without host involvement is a thing, but it requires satisfying certain system specifications. It is unknown whether your system does so, but what minimal information has been provided suggests that it does not.
In my understanding the main point of cublasXt is to work around GPU memory capacity limits. It can distribute large GEMMs across multiple GPUs plus the host system, utilizing essentially all of the memory in the entire system (so called “out of core” scenario).
Obviously this requires data to be parceled out and shipped around to available computational resources, and some data to be shipped back to assemble a final result. Where peer-to-peer communication is not available, this communication overhead can limit performance, but that does not detract from the intended main benefit of cublasXt, i.e. handling matrix operations that will not fit into a single GPU.
I am not a cublasXt expert. If you can find language in NVIDIA’s documentation that states that the goal of cublasXt is to accelerate GEMMs by executing them across multiple GPUs, by all means point it out here.