Profiling multi-gpu code

I am looking for some advice on how to profile multi-gpu code and what expectations I should have. In particular, I am experimenting with a dual-GPU GEForce GTX Titan Z, and I am comparing a simple kernel that adds a constant to a vector that can run on either both GPUs or a single GPU.

I was finding that the results I am getting are all over the place and highly dependent on the size of the vector. If I use too small of a vector I was actually seeing a performance degradation on multi-gpu (which I assume is due to the additional overhead of using multiple GPUs). But should I expect gradually better performance as I increase the size of the vector, or is that a poor assumption?

Any feedback would help.

adding a constant to a vector is not a good use of a GPU. The performance will be entirely memory bandwidth bound, which means above a certain small vector size (say, a few thousand elements) there will be no scaling benefit. Longer vectors will take a linearly longer amount of time to update, regardless of whether you are using one or 2 GPUs.

Furthermore, if this is the only work you are doing on the GPU, the cost to transfer data to/from the CPU will swamp any benefit from doing the work on the GPU.

You might try an operation that scales non-linearly with increased data size, such as matrix multiply.