cudaMemcpyAsync cpu Load?

D2D_CUDA8::D2D_CUDA8(int nGPUIdx)
{
	err = cudaSetDeviceFlags(cudaDeviceBlockingSync);
	err = cudaSetDeviceFlags(cudaDeviceScheduleYield);
	cudaSetDevice(nGPUIdx);
}
void D2D_CUDA8::RDMASetMemory(SIZE_T size)
{
    err = cudaMallocHost((void**)&RDMA_Memory, size);
}
void D2D_CUDA8::CopyMemoryToDevice(int* Source, SIZE_T W, SIZE_T H)
{
    for (SIZE_T i = 0; i < H; i++)
    {
        err = cudaMemcpyAsync(RDMA_Memory + i * W, Source + i * W, W * sizeof(int), cudaMemcpyHostToDevice);
    }
}
void D2D_CUDA8::CopyMemoryToHost(int* Source, SIZE_T size)
{
	err = cudaMemcpyAsync(Source, RDMA_Memory, size, cudaMemcpyDeviceToHost);
}

int main()
{
    cuda = new D2D_CUDA8(1);
    for (int i = 0; i < 100; i++)
    {
        unsafe
        {
            cuda.CopyMemoryToDevice((int*)m_mainMemory.GetPtr().ToPointer(), (ulong)(40000 * 40000), (ulong)m_mainMemory.H);
            cuda.CopyMemoryToHost((int*)m_subMemory.GetPtr().ToPointer(), (ulong)(40000 * 40000));
        }
    }
}

I made a code to copy an image of size 40000*40000 from mainMemory to Device, and then from Device to subMemory again. The function of copying works well.
(In order to visually check the load rate in the task manager, it was repeated about 100 times)
I thought that this copying process was done through DMA.
However, this code constantly generates a cpu load of approximately 10 to 20%.
I’ve done nothing else but copy the memory, but is this copying process done by GPU through DMA, but with the help of CPU?
Can’t I just do this copy process with GPU without CPU load?
My program is already under a lot of CPU load, so I don’t want additional CPU load on cudaMemcpy process

It’s curious to me to call both of those. I think the second one contradicts the first.

note that when you are calling cudaMemcpyAsync into the null stream, it is synchronizing on the device side. I almost never use that paradigm, so I don’t remember what the CPU thread behavior is. But apart from those calls, you have nothing in your main routine which would block the CPU thread. Therefore if those calls were truly asynchronous from the CPU thread, I would expect your app to exit well before the sequence of copy operations were completed.

I guess your expectation is that by setting cudaDeviceScheduleYield you are expecting approximately zero cpu usage by that thread. Let’s take a look at the definition for that token:

“Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device.”

You only have one thread in your application. So yielding to another thread doesn’t make much sense to me. I would expect that a yield operation might simply acknowledge that there are no other threads to yield to.

There aren’t any statements in that definition about what to expect for CPU thread usage, only that it “can increase the performance of CPU threads performing work in parallel with the device.”. Your app has no way to demonstrate or measure such a thing. I don’t see any statements that say “this will drive your CPU thread usage to zero percent as measured by some unspecified tool.”

void D2D_CUDA8::CopyMemoryToDevice(int* Source, SIZE_T W, SIZE_T H)
{
    for (SIZE_T i = 0; i < H; i++)
    {
        err = cudaMemcpyAsync(RDMA_Memory + i * W, Source + i * W, W * sizeof(int), cudaMemcpyHostToDevice);
    }
}
void D2D_CUDA8::CopyMemoryToHost(int* Source, SIZE_T size)
{
	err = cudaMemcpyAsync(Source, RDMA_Memory, size, cudaMemcpyDeviceToHost);
}

I made two mistakes on my code.

  1. The devicePointer RDMA_Memory used in this code should be assigned to cudaMalloc, not cudaMallocHost.
  2. Host memory in cudaMemcpyAsync Source must be pinned by using cudaHostRegister method.

My code was working in Sync mode, not Async.
after i modify RDMA_Memory to assign to cudamalloc and pin the Source pointer by using cudaHostRegister, main Function, like you said, returned immediately without any copy operations.

To verify that it is copied, cudaDeviceSynchronize() must be called.
but, it appears to have cpu Load and long operating time.

void D2D_CUDA8::CopyMemoryToDevice(int* Source, SIZE_T W, SIZE_T H)
{
    cudaHostRegister(Source, W*H); << CPU Load + Delay
    for (SIZE_T i = 0; i < H; i++)
    {
        err = cudaMemcpyAsync(RDMA_Memory + i * W, Source + i * W, W * sizeof(int), cudaMemcpyHostToDevice);
    }
}
int main()
{
    cuda = new D2D_CUDA8(1);
    for (int i = 0; i < 100; i++)
    {
        unsafe
        {
            cuda.CopyMemoryToDevice((int*)m_mainMemory.GetPtr().ToPointer(), (ulong)(40000 * 40000), (ulong)m_mainMemory.H);
            cuda.CopyMemoryToHost((int*)m_subMemory.GetPtr().ToPointer(), (ulong)(40000 * 40000));
        }
    }
cudaDeviceSynchronize(); <<<<CPU Load + Delay
}

Eventually, even if the cpu Load can be eliminated from copying,
the occurrence of CPU Load and lond delay seems inevitable because of the operation of pinning memory with cudaHostRegister and CPU synchronize with cudaDeviceSynchronize()