I spent some time working this one out. In the main thread, cublasInit() and all space allocation on the GPU were done. The other thread did the actual copying to GPU memory and the CUBLAS matrix multiplies.
Kaboom! you can’t copy to the GPU when allocations were organized in a different thread.
Doing the lot in one thread runs just fine. Hope this helps someone, as the error messages don’t identify the problem.
PS Love the speedups CUDA brings to matrix multiplies!