I am writing a multithreaded application using the driver API where I have a pool of threads that are performing CUDA tasks. Each thread has its own context which does not change, so I do not use cuCtxPopCurrent.
I have data in page locked host memory and a thread performs the following operations:
cuMemcpyHtoD; Kernel; cuMemcpyDtoH;
Page locked host memory and device memory have been allocated within the threads context. To meassure the throughput the kernel is empty.
Now I want to process a set of data item of lets say 100 KB each. When I use one thread to process one item after another, performance is fine. When use 2 threads which are selected in round robin style, the performance breaks in by roughly 50 %. It does not decrease further if I use more threads.
Does someone know how much overhead a context switch causes on device. I thought the push/pop operations are costly on CPU side, but not on GPU side.
Thanks in advance,