Overlapping Transfer and Device Computation

I’m optimizing my code by overlapping transfer and device computation as explained in the “OpenCL Best Practices Guide”.
I currently don’t have a device with compute capability 2.0 and so I cannot test the overlapping with two independent data transfers.
However, in the close future, I will be able to work on with Fermi-based GPUs and I would like to have the code ready.

The example in the guide only shows how to work with two queues and I was wondering if one simply needs 3 queues to do a copy H2D, a copy D2H and a computation simultaneously or if there are some other factors that one should take care of…