And if your GPU doesn’t have two copy engines, you can still overlap one copy with a kernel that uses mapped memory to perform the transfer in the opposite direction to achieve about the same throughput.
The answer to your first question is a lot more difficult, because it depends on what you consider “running at the same time”. For practical reasons of understand your kernel’s performance however we can mostly ignore these more esoteric questions. The most important factor to affect the kernel’s performance is the achievable throughput for each instruction, which is listed in the “Maximize Instruction Throughput” section of the Programming Guide.
If throughput considerations alone don’t explain your findings, the next important concept is that of latency. Each instruction needs a certain amount of time to perform it’s operation. Independent instructions following in the instruction stream are performed before the results of the previous instructions are available. However if an instruction dependent on one of the previous results comes up, execution has to wait for the result to become available.
Latency for instructions operating entirely in registers is about 22 cycles (less for Kepler. Latency for shared memory accesses is a bit over 30 cycles. Latency for L1 cache hit presumably is about the same. Latency for global memory access is somewhere between 400 and more than 1000 cycles, depending on memory bus load and a couple of other factors. Unfortunately these numbers are not published by Nvidia.