At the moment, streams are only for controlling asynchronous operations on within a single GPU context. If you are using two GPUS, you currently need to use two threads because GPU contexts require their own thread.
It is probably context, thread and GPU affinity issues, which are very hard to manage correctly with CUDA and OpenMP as things stand today. You might want to consider using something different for threading (say boost or a native thread library).
EDIT: Of course their also is the possibility that both contexts are winding up on the same GPU. How are you assigning GPUs in the code?
I started from the beginning with openmp again cause that’s the project, it works up to this point.
The assigning is exactly the same, based on omp_get_thread_num, and I set 2 omp threads, as many as the GPUs.
I really haven’t managed to find the problem.
But I can say that things are very fragile. I had to compile/execute after every line of code, a little mistake could have very weird results