I wouldn’t worry about how many copy engines there are.
The maximum concurrency is the following:
- data transfer from host to device
- one or more kernels executing on the device
- data transfer from device to host
- CPU code activity
You can achieve that level of concurrency without thinking too hard about copy engines.
If you want to achieve multiple transfers concurrently, in the same direction, there is no point trying to do that. In my experience, it is not possible to observe that, and it should make no difference anyway, because you are using a pipe with fixed bandwidth. Even if you could run more transfers, there would be no benefit, compared to running those transfers serially, which is typically what I observe.
And as njuffa indicated, copy engines have nothing to do with transferring data between global and shared memory.