Concurrent kernel execution, blocking device to host transfers, and mapped memory

I’m working on an application where I would like to reduce the latency between kernel calls, while keeping the GPU busy so that I maintain good throughput. I am using a Fermi GTX 480 device, and was hoping to use multiple streams to accomplish this, however after a little research, I am left wondering if what I want to do is possible: I was planning to use a large number of small kernels in separate streams in order to reduce latency while keeping the device busy. My application is very compute bound, so I am not especially concerned with overlapping memory accesses and computing.

I listened to the ‘Streams and Concurrency’ webinar, and it would seem that I have to choose between the following:

  1. Issue memcopies / kernels in breadth first streams: kernels can execute concurrently on the same device, however device-to-host memory transfer is blocked until all the kernels in the compute queue have completed
  2. Issue memcopies / kernels in depth first streams: kernels cannot execute concurrently, however device-to-host memory transfers are not blocked by other streams’ kernels

I’m posting to see if there is any way around this (even if it’s a bit hacky). Also, I was wondering if using mapped memory could possibly solve my problem. Are mapped-memory device-to-host transfers queued in the same (blocking) way that non-mapped transfers are? I was hoping that since they aren’t explicitly inserted in the stream that this might let me get around the whole ‘blocked until all kernels are done executing’ issue.

Any comments or insight is appreciated!

The only way around this is to wait for GK110, which has multiple independent work queues. The single queue on Fermi is even frustratingly limiting when all you are doing is launching kernels! Watch the GTC talk “New features in the CUDA programming model” for details on this.

I have had some success in using mapped memory to reduce latency. In my application, there are a few kernels that determine yes/no answers that I need back on the host. Depending on the answer, a different code path is taken. Using host mapped memory and a cudaEventSynchronize shaved a couple microseconds off compared to a cudaMemcpy to copy back a single int.

Why do you think you can reduce the overhead of latency by issuing multiple kernel calls? You pay the latency for each kernel invocation, so splitting your work into multiple calls will only add more latency into your critical path. It would be better just to launch everything in one big kernel. The latency is only ~10 microseconds.

Thanks for the reply,

I was at the GTC2012 conference and got pretty excited about the hyper-q feature when they announced it. Unfortunately I don’t think I’ll be able to get my hands on a GK110 unless they put it in something consumer grade in addition to the Tesla K20. Can anyone from NVidia comment on the likelihood that hyperq-q will end up in consumer grade stuff?

I realize that issuing multiple kernel calls will reduce my throughput, however I believe I can put this technique to use in order to reduce the latency between some data showing up and getting a result. I’m trying to see if I can carry out some operations in realtime. If I issue one large kernel, I might have to wait up to a second for it to finish processing, if I issue a bunch of smaller ones it will take a little longer overall, but maybe I’ll only have to wait less than 100ms from data arriving in my system to having a useful result.