When you invoke kernels you have to be in the right host thread. But what about when doing memory copy operations, like cudaMemcpy2D and cudaMemcpy2DToArray? Is it necessary to make sure these get called in the same thread as the one that created the CUDA resources?
This is also true for pinned (page-locked) host memory allocated with cudaMallocHost(). You have to call cudaMallocHost() and the cudaMemCpy*() functions from the same thread.