Contexts: Performance question overhead by switching the context


I am writing a multithreaded application using the driver API where I have a pool of threads that are performing CUDA tasks. Each thread has its own context which does not change, so I do not use cuCtxPopCurrent.

I have data in page locked host memory and a thread performs the following operations:
cuMemcpyHtoD; Kernel; cuMemcpyDtoH;
Page locked host memory and device memory have been allocated within the threads context. To meassure the throughput the kernel is empty.

Now I want to process a set of data item of lets say 100 KB each. When I use one thread to process one item after another, performance is fine. When use 2 threads which are selected in round robin style, the performance breaks in by roughly 50 %. It does not decrease further if I use more threads.

Does someone know how much overhead a context switch causes on device. I thought the push/pop operations are costly on CPU side, but not on GPU side.

Thanks in advance,

Context switching causes significant overhead on the GPU, as you’ve seen. cuCtxPush/PopCurrent are really the best ways to accomplish what you’re trying to do.

Thanks for your quick reply.

But then I have another question: If I use multiple GPUs and have one context and one host thread for each GPU, would that also cause an overhead if I switch between these contexts for each data item? Or does it apply only to a single GPU?

It applies to a single GPU only.