Hi! I use deviceQuery.exe and find out my Geforce 1650 has: Concurrent copy and kernel execution: Yes with 6 copy engine(s)
I am wondering how to make full use of it. I have three cases related:
I am trying to use “ping-pong” technique, which is:
load data to location A
kernel execution1
load data to location B
kernel execution2
And because of the property of GPU, the kernel execution1 will overlap with data loading B!?!? (I think this is true)
Now, with 6 engines, can I somehow make this more efficient? (Maybe I can consider, if the kernel execution time is equal to 6 copy steps, and I can make 6 memory copy concurrent with 1 kernel execution!?!?)
another thing is, I am doing SGEMM, initially I will load tiled A and B into sharedA and sharedB. Because I have 6 engines, these two steps will actually implemented at the same time? (And maybe I can do 6 copy steps at the same time!?!?)
Also, the shared memory copy to register, and global memory copy to shared memory can be executed at the same time, because I have multiple copy engine?(I wonder this because I have heard kernel execution can concurrent with memory copy because they are on different devices, but R->G and G->S are both memory type, can they be concurrent…?)
----------------------But maybe this effort is not valuable…I just searched A100 and find it only has 2 copy engines…Maybe have 6 copy engines is very strange? And will this like registers, if you use too much, the active SM will be limited?---------------------------------
Thank you!!!
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce GTX 1650"
CUDA Driver Version / Runtime Version 11.6 / 11.3
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 4096 MBytes (4294639616 bytes)
(016) Multiprocessors, (064) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1560 MHz (1.56 GHz)
Memory Clock rate: 4001 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes =49.152KB
Total shared memory per multiprocessor: 65536 bytes =65.536KB
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS
Weird, the GTX 1650 is reported with 2 copy engines in screenshots I find on the internet. Fairly recently we had a question in these forums why the RTX 3080 is reported with 1 copy engine, which seems incorrect.
This makes me wonder whether (1) NVIDIA’s definition of copy engine has changed (2) there is a software bug, causing the number of copy engines to be reported incorrectly.
I can see how more than 2 copy engines being present is useful in GPUs that support NVlink in addition to PCIe, but six copy engines on a low-end consumer-grade GPU makes zero sense to me.
Thank you very much!!! Well, I am actually not trying copy memory from host to device, but trying to copy from global memory to shared/register. I am even not trying to use stream! Just like ping-pong, we write it normally, but actually GPU will overlap those instructions automatically. (Strange, but from the code running, they are supposed to work). (Just like below example: SGEMM · NervanaSystems/maxas Wiki · GitHub) And from document, I think this is possible(…?) Like:
It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.
Checking right now, my Quadro RTX 4000 reports 6 copy engines, where in the past it would definitely report 3. I have no idea what is going on.
Copy engine is really a marketing term for a DMA mechanism. And unless something has changed in recent years: (1) these are for DMA transfers across external interfaces (PCIe, NVlink). (2) copy APIs that transfer data within global memory on same device use pre-defined internal kernels, not DMA engines.
None of this has anything to do with transferring data from global memory to shared memory or registers. That is done by regular program code of your kernel using load and store instructions.
Since PCIe is a full-duplex interconnect, any GPU with 2 or more copy engines can achieve simultaneous host->device and device->host copies. I am reasonably sure that NVIDIA provides a sample app that demonstrates that. One copy engine is needed for each transfer direction, and two copy engines can saturate the available PCIe bandwidth. So off-hand I don’t see how having more copy engines available would help, unless there are other interfaces (such as NVlink) that also should be serviced concurrently.
Maybe someone from NVIDIA can provide an authoritative description about the state of copy engines in GPUs at this time.
Thank you very much!!! Just two last questions, maybe you could answer them just yes or no, to save your time~
2. I am doing SGEMM, initially I will load tiled A and B into sharedA and sharedB. These two processes will be executed sequentially???
Also, the shared memory copy to register, and global memory copy to shared memory can be executed at the same time, These G->S and S->R can be executed concurrently?
I wouldn’t worry about how many copy engines there are.
The maximum concurrency is the following:
data transfer from host to device
one or more kernels executing on the device
data transfer from device to host
CPU code activity
You can achieve that level of concurrency without thinking too hard about copy engines.
If you want to achieve multiple transfers concurrently, in the same direction, there is no point trying to do that. In my experience, it is not possible to observe that, and it should make no difference anyway, because you are using a pipe with fixed bandwidth. Even if you could run more transfers, there would be no benefit, compared to running those transfers serially, which is typically what I observe.
And as njuffa indicated, copy engines have nothing to do with transferring data between global and shared memory.