I am looking for some advice on how to profile multi-gpu code and what expectations I should have. In particular, I am experimenting with a dual-GPU GEForce GTX Titan Z, and I am comparing a simple kernel that adds a constant to a vector that can run on either both GPUs or a single GPU.
I was finding that the results I am getting are all over the place and highly dependent on the size of the vector. If I use too small of a vector I was actually seeing a performance degradation on multi-gpu (which I assume is due to the additional overhead of using multiple GPUs). But should I expect gradually better performance as I increase the size of the vector, or is that a poor assumption?
Any feedback would help.