So I found the problem that I described in an earlier thread. Apparently there is a bug in the function “nppiResize_8u_C1R” (and possibly others).
The problem is that in CUDA 4.0 one can not execute two nppiResize_8u_C1R functions concurrently. They both operate on different data and share no resources, yet they cause data errors when running at the same time. I could resolve it by encapsulating all the resize functions in a Mutex to ensure serial execution. But I guess there’s just some bug and it should be possible to run them concurrently when no data is shared.
No other functions, be it NPP or my own kernels, seem to have this problem in my code. Looks like it’s something specific to these resize functions that now shows up with the shared context for different threads in 4.0
As stated, I use the latest CUDA 4.0 on a 32-bit machine. Hopefully someone can take a look.
No, I’m running two threads that both call nppiResize for separate input and output data. The way I understand it nppSetStream() would not make much sense because the other thread could change the stream at any time.
SO they might not execute concurrently on the GPU and I don’t actually care for now. What I care about is that they cause errors when called from parallel threads. This wasn’t the case in CUDA 3.2 due to the different device contexts but has shown up now.
As I said, when I serialize the calls from my parallel threads with a mutex it works fine, but as the resize function seems to be the only one (so far) having that problem I suspect it’s a bug.
I think that it’s a bug like you say. I looked at the primitive in question and I have a theory about what causes this.
If my theory about the cause turns out to be right, then there should be many more primitives that can cause problems. But it’s not deterministic, i.e. depends on the progress the various threads make through the host-code of the primitive.
I’ll file an internal bug for this. You should expect a fix for this for 4.1 release. Until then putting pretty much any NPP call into a critical section might be the proper work around. In fact our fix for 4.1 will probably do exactly that internally to the primitives affected.