Slow Memory Copies

GPU: Quadro P4000 in TCC mode.


  • Data is being RDMA’d into the GPU.
    – Chunks of 1 GPU page - 64KB.
    – 10 chunks per “cycle” - 640KB.
    – 800 cycles - 500MB.

If I do a cudaMemcopy, 800 mem. copies, the copy is very fast, in the microseconds.

  • Makes sense given GPU RAM speeds.

If I move this into a kernel.

  • 1 block, 10 threads per block.
  • each thread has a for-loop, looping 80 times each.
  • Kernel arguments take in pointers to data.
  • Pointers are cast to a struct with an array of 1 gpu page size.
    struct buffer
    uint8_t data[GPU_PAGE_SIZE];
  • use struct assignment to perform the deep-copy of the array.
    *pDst = *pSrc;

This same 500mb copy goes from milliseconds to tens of seconds - about 27 seconds.

  • Using NSIGHT, the delay is definitely in the kernel and not the overhead to launch the kernel, etc.

Can someone explain why such a big difference?

  • I expected a significant delta, but definitely not this big.
  • Does it have anything to do with thread access between BAR1 memory vs FB memory?
  • Both regions of memory should be contiguous within the GPU.
  • Is it the for-loops?
  • The pointer increments?

Another test I ran, with an overall destination of 1GB (instead of 500MB) saw the multiple cudaMemcopies take ~78ms while the kernel version took ~32 seconds.

Thanks for any insight.

What does that mean exactly? You are using GPUDirect RDMA to transfer data from a 3rd party device to the GPU? Or do you simply mean you are copying data from host to device? Or are you referring to a copy on the same device?

You can transfer 500MB in microseconds? That doesn’t seem plausible unless you are referring to a device-to-device copy on the same device.

That is a really bad way to do data copying on the GPU, if you are doing one struct assignment per thread.

Thank you for the reply!

  1. Yes GpuDirect from FPGA over PCIe.

  2. Very possible given GPU memory speeds - I was able to copy 1GB in 11ms according to NSIGHT.
    – cudaMemcopy
    – 11,684.119 microseconds
    – 87,640.3 MB/s data rates

  3. Good information. What is the preferred method (suggested way)?

I see. But that really has nothing to do with your question, right? Your questions about copy speeds have nothing to do with copying data over PCIe, correct?

So I guess you are referring to a device-to-device copy on the same device here (not over PCIe, not from host to device). You have not answered my question, but that’s the only way the data could make sense. And I’m amused that a generic reference to “microseconds” includes a range up to 11,000 microseconds, but I quibble.

If data were flowing over PCIe, any valid/proper measurement of data transfer throughput could not exceed PCIe throughput, which is on the order of 12.5GB/s for a Gen 3 x16 link. 500MB of data could not be any faster than 0.5/12.5 s = 1/25 sec = 40 milliseconds (approx).

You want adjacent threads to read adjacent data, and write adjacent data. That will not happen with a struct-per-thread copy. Just read any resource on CUDA “coalescing” behavior.

Correct: For this question, I think GpuDirect is working just fine.

Sorry for my vagueness, I failed to make it clear.

Once the data is on the GPU, I am doing a DevtoDev (Bar1 region to FB memory region) copy. This is so that GpuDirect RDMA buffer can be overwritten while the second buffer can be used for “data-processing” - in theory by another kernel (that portion all still tbd).

I will look into this topic. Since the memory allocated is contiguous and the threads are also “contiguous” (1block, 10threads on the x within the block), I though I would be OK.

You don’t want to be using 10 threads per block either. You should choose threadblock sizes as whole-number multiples of 32. If you google “cuda copy kernel” I think you’ll come across a lot of useful reading and examples.

More background:

  • I give the FPGA a page-table with addresses to each GPU page I have allocated.
  • The FPGA has limited resources and, for the sake of this thread, let’s say the FPGA can only store a page-table large enough to reference 10 pages on the GPU.
  • This is why I then decided to use 10 threads in the Kernel - and since all they are doing is copying data, dev-to-dev, at GPU Page sizes, I didn’t think there would be any inefficiency.
    – I obviously have a lot to learn, still.

So I looked up Coalesce and I’ve changed my kernel around:

  • Each thread will copy a DWORD (4 bytes)
  • Each thread will loop x number of times based on the number of GPU pages that need to be copied to fill the destination buffer.

10 source GPU pages to fill 1GB of destination data.

Source Buffer:

  • 655360 B (10 * 64KB pages)
  • 655360 / 4 bytes per thread = 163840 threads needed
  • 163840 threads / 1024 threads per block = 160 blocks.

Destination Buffer:

  • 1073741824 B (1GB)
  • 1GB / 4 (bytes) = 262144 threads
  • 262144 threads / 1024 (threads per block) = 262144 blocks

1GB / 64KB = 1638 source buffers to fill 1 destination buffer.
1GB % 64KB = 262144 left over bytes
– 262144 / 4 bytes per thread = 65536 left over threads
– 65536 / 1024 threads per block = 64 “left over” blocks

So my kernel, with all 163840 threads, will loop 1638 times to copy 1638 source buffers to fill 1 destination buffer.

Then threads 0 - 65535 will do a final copy to take care of the “left over” data.
– Threads 65536 - 163839 will simply exit.

All this take about 6 seconds.

Not really good numbers and the maintenance of the code, for new eyes, may not be straight forward.

Is it possible for a Quadro P4000 to generate an interrupt from a CUDA Kernel to notify Host software?

  • I tried using a windows condition variables but that wasn’t working out.