I have a high performance computing application, but I do not need multiple “virtual GPUs”. Performance wise, a P2000 GPU has everything I need, except that I’m worried that it may not support simultaneous execution of DMA uplink of data from host to device memory, downlink from device to host, and kernel execution. Although my computational requirement is comfortably below 1 TOPS, I need this feature because of the memory bandwidth requirement for my application. Does the P2000 support this feature? The P2000 data sheet, which has only a few top level specs on it, does not tell me this. The Cuda C Reference seems to imply (section 126.96.36.199) that the answer depends on a property called asyncEngineCount, but I did not find anything on the Nvidia website that tells me what the asyncEngineCount for the P2000 is. I’m sure the P40 will do what I want, but it is massive overkill computationally. Please advise whether I can do concurrent DMA up, DMA down, and computation with the P2000?