So I found the problem that I described in an earlier thread. Apparently there is a bug in the function “nppiResize_8u_C1R” (and possibly others).
The problem is that in CUDA 4.0 one can not execute two nppiResize_8u_C1R functions concurrently. They both operate on different data and share no resources, yet they cause data errors when running at the same time. I could resolve it by encapsulating all the resize functions in a Mutex to ensure serial execution. But I guess there’s just some bug and it should be possible to run them concurrently when no data is shared.
No other functions, be it NPP or my own kernels, seem to have this problem in my code. Looks like it’s something specific to these resize functions that now shows up with the shared context for different threads in 4.0
As stated, I use the latest CUDA 4.0 on a 32-bit machine. Hopefully someone can take a look.