cublasXT performance

I have a system with 2 K20x Kepler cards. I tried to replicated the 2x scale using 2 cards + cublasXT for sgemm, dgemm and zgemm. I was not able to replicate the 2x for sgemm or dgemm, the score with 1 or 2 cards is about the same. I was able about 2x scaling only with zgemm.
Any suggestions?


Got it to work :)

It might be interesting for other forums participants to read what the initial stumbling block was, how you resolved it, and what kind of scaled performance you were able to achieve for your use case.