On GTX 470 if I create 2 contexts from 2 CPU threads and execute CUDA kernels in each of these contexts with very little amout of CPU-side processing, I get up to 20% performance increase.
Why can it happen? Can GPU execute kernels from different contexts simultaneously? Or may be kernel from one context is executed while, for example, another CPU thread is waiting for data to arrive from GPU memory?
From CUDA Programming Guide 3.1 Chapter 18.104.22.168, p. 38:
This works only with some CC 2.0 devices and yours is one of these. So maybe yes, if a single kernel does not stress out the GPU enough, this could be the reason for the performance increase. For doing memcopies and kernels concurrently (works with CC 1.1 +) you would need to use different streams. One executing a kernel and the other one doing the memcpy.