cudaMemcpyAsync returns weird error

cudaMemcpyAsync() is returning an error ‘invalid device ordinal’ when copying two buffers from one device to another. As a debug check, I copy from the same device buffers to the host and call cudaGetLastError() just prior to the cudaMemcpyAsync(). The first two buffer copies succeed and no errors are reported prior to the device-to-device copy.

The device-to-device copy fails no matter how small the buffer size.

Additional info:
– Copying from K20 to GTX Titan Black.
– Running RHEL v6.8 x64
– CUDA SDK v7.5
– Driver version 352.99

Any ideas, pls?