Different kernel launches, be they from one or multiple host threads, are executed one at a time on the device. While intermingling different kernel launches might seem a like a good idea at first, a number of memory and synchronization issues creep up, bringing efficiency down.
In most cases, rethinking the parallelization approach helps. Perhaps you can have a 3rd host thread, which will be the only host thread communicating with the CUDA device. The other two host threads would then fill out the data structures and signal to the 3rd thread to launch CUDA kernels and memcopies.
Can you describe your application in more detail? If you don’t want to disclose details publicly, you can send me a message.