Application speedup using multistream Is there an OpenCL exemple using multiple command queues?

I’ve been trying to build a multi command-queue application for the past few days and haven’t figured how to do it. The goal is simply to process mem transfers while a kernel is executing to gain efficiency.

Here is what I did:

  • I create two command queues in the same context.
  • I queue a kernel execution into the first and a non blocking (device to host) mem transfer into the second.
  • Using OpenCL visual profiler, I clearly see my 2 streams, but they don’t run in parallel.

Anyone has succeeded into hiding transfer time ? Is there a optional parameter that need to be activated ? Do I need to queue kernel command and transfer command from two different PC threads ?

I’d appreciate some help…