Thread safetly in CUDA 4.0 Single device context creates problems

Hi,

let’s say my program runs 2 parallel threads that both use CUDA and NPP functions. It worked fine in CUDA 3.2, as each thread got it’s own device context and they didn’t interfere with each other.

Now I switched to CUDA 4.0 and the thread start disturbing each other. I don’t know exactly what is going on, but some erroneous data is copied back to the host sometimes, and I’m pretty sure it’s a thread safety problem, as everything works fine with only one of them running.

Now my question is what the preferred method to handle this is. Should I use streams for the memory copies and kernels? But what about the NPP calls then? To clarify, it is not important for me to run different CUDA calls concurrently. I just want to prevent calls from different thread interfereing with each other.

Thanks

It would already greatly help me if someone could shed some light on what might be going wrong.
The two thread I’m using don’t operate on shared variables, so theoretically they shouldn’t get in each others way. Both are called from inside a class and use their own data and variables. There are no Async calls involved.
Yet I occasionally get errors in the results, like a processed image being half black.

Can someone explain what might cause this?