CUDA called from multiple threads

In short, is there any way of calling CUDA runtime API functions or kernels from multiple threads? I have tried using a mutex covering all CUDA related code in a thread, but it still crashes, and I can’t determine the nature of the error as the program exits before cudaGetLastError is called. Multiple threads could allow me to utilise all available cores for non-CUDA activities.
If it is possible, is there anything special I have to do, or am I just using boost thread and mutex incorrectly?

Many thanks in advance

The issue is not thread safety so much as that CUDA contexts in the runtime API are associated with the host thread that created them (implicitly with the first cuda function call). Mutex protection is not sufficient. Instead, you need to guarantee that all CUDA calls are made by the same thread for the life of your program. It doesn’t matter which one, just has to be the same.

Multi-GPU programming is accomplished by spawning several host threads and calling cudaSetDevice with a different value in each thread.