Concurrent copy and kernel execution: Yes with 6 copy engine(s)---How to make full use of it?

Hi! I use deviceQuery.exe and find out my Geforce 1650 has: Concurrent copy and kernel execution: Yes with 6 copy engine(s)
I am wondering how to make full use of it. I have three cases related:

  1. I am trying to use “ping-pong” technique, which is:
load data to location A
kernel execution1
load data to location B
kernel execution2

And because of the property of GPU, the kernel execution1 will overlap with data loading B!?!? (I think this is true)

Now, with 6 engines, can I somehow make this more efficient? (Maybe I can consider, if the kernel execution time is equal to 6 copy steps, and I can make 6 memory copy concurrent with 1 kernel execution!?!?)

memory copy 1
memory copy 2
memory copy 3
memory copy 4
memory copy 5
memory copy 6
kernel execution
  1. another thing is, I am doing SGEMM, initially I will load tiled A and B into sharedA and sharedB. Because I have 6 engines, these two steps will actually implemented at the same time? (And maybe I can do 6 copy steps at the same time!?!?)
sharedA[....] = A[...?...]
sharedB[...] = B[...?.....]
  1. Also, the shared memory copy to register, and global memory copy to shared memory can be executed at the same time, because I have multiple copy engine?(I wonder this because I have heard kernel execution can concurrent with memory copy because they are on different devices, but R->G and G->S are both memory type, can they be concurrent…?)

----------------------But maybe this effort is not valuable…I just searched A100 and find it only has 2 copy engines…Maybe have 6 copy engines is very strange? And will this like registers, if you use too much, the active SM will be limited?---------------------------------
Thank you!!!

Is this GPU reported as GeForce GTX 1650 with Max-Q Design by any chance?

1 Like

I think…maybe not…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce GTX 1650"
  CUDA Driver Version / Runtime Version          11.6 / 11.3
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294639616 bytes)
  (016) Multiprocessors, (064) CUDA Cores/MP:    1024 CUDA Cores
  GPU Max Clock rate:                            1560 MHz (1.56 GHz)
  Memory Clock rate:                             4001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes  =49.152KB
  Total shared memory per multiprocessor:        65536 bytes  =65.536KB
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS

Weird, the GTX 1650 is reported with 2 copy engines in screenshots I find on the internet. Fairly recently we had a question in these forums why the RTX 3080 is reported with 1 copy engine, which seems incorrect.

This makes me wonder whether (1) NVIDIA’s definition of copy engine has changed (2) there is a software bug, causing the number of copy engines to be reported incorrectly.

I can see how more than 2 copy engines being present is useful in GPUs that support NVlink in addition to PCIe, but six copy engines on a low-end consumer-grade GPU makes zero sense to me.

1 Like

Thank you very much!!! Well, I am actually not trying copy memory from host to device, but trying to copy from global memory to shared/register. I am even not trying to use stream! Just like ping-pong, we write it normally, but actually GPU will overlap those instructions automatically. (Strange, but from the code running, they are supposed to work). (Just like below example: SGEMM · NervanaSystems/maxas Wiki · GitHub) And from document, I think this is possible(…?) Like:

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.

(3.2.6.3)

Checking right now, my Quadro RTX 4000 reports 6 copy engines, where in the past it would definitely report 3. I have no idea what is going on.

Copy engine is really a marketing term for a DMA mechanism. And unless something has changed in recent years: (1) these are for DMA transfers across external interfaces (PCIe, NVlink). (2) copy APIs that transfer data within global memory on same device use pre-defined internal kernels, not DMA engines.

None of this has anything to do with transferring data from global memory to shared memory or registers. That is done by regular program code of your kernel using load and store instructions.

Since PCIe is a full-duplex interconnect, any GPU with 2 or more copy engines can achieve simultaneous host->device and device->host copies. I am reasonably sure that NVIDIA provides a sample app that demonstrates that. One copy engine is needed for each transfer direction, and two copy engines can saturate the available PCIe bandwidth. So off-hand I don’t see how having more copy engines available would help, unless there are other interfaces (such as NVlink) that also should be serviced concurrently.

Maybe someone from NVIDIA can provide an authoritative description about the state of copy engines in GPUs at this time.

1 Like

Thank you very much!!! Just two last questions, maybe you could answer them just yes or no, to save your time~
2. I am doing SGEMM, initially I will load tiled A and B into sharedA and sharedB. These two processes will be executed sequentially???

sharedA[....] = A[...?...]
sharedB[...] = B[...?.....]
  1. Also, the shared memory copy to register, and global memory copy to shared memory can be executed at the same time, These G->S and S->R can be executed concurrently?

Thank you very much!!

I wouldn’t worry about how many copy engines there are.

The maximum concurrency is the following:

  1. data transfer from host to device
  2. one or more kernels executing on the device
  3. data transfer from device to host
  4. CPU code activity

You can achieve that level of concurrency without thinking too hard about copy engines.

If you want to achieve multiple transfers concurrently, in the same direction, there is no point trying to do that. In my experience, it is not possible to observe that, and it should make no difference anyway, because you are using a pipe with fixed bandwidth. Even if you could run more transfers, there would be no benefit, compared to running those transfers serially, which is typically what I observe.

And as njuffa indicated, copy engines have nothing to do with transferring data between global and shared memory.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.