Slow Memory Copies

nunez.juan · November 5, 2018, 7:41pm

GPU: Quadro P4000 in TCC mode.

Scenario:

Data is being RDMA’d into the GPU.
– Chunks of 1 GPU page - 64KB.
– 10 chunks per “cycle” - 640KB.
– 800 cycles - 500MB.

If I do a cudaMemcopy, 800 mem. copies, the copy is very fast, in the microseconds.

Makes sense given GPU RAM speeds.

If I move this into a kernel.

1 block, 10 threads per block.
each thread has a for-loop, looping 80 times each.
Kernel arguments take in pointers to data.
Pointers are cast to a struct with an array of 1 gpu page size.
struct buffer
{
uint8_t data[GPU_PAGE_SIZE];
}
use struct assignment to perform the deep-copy of the array.
{
*pDst = *pSrc;
}

This same 500mb copy goes from milliseconds to tens of seconds - about 27 seconds.

Using NSIGHT, the delay is definitely in the kernel and not the overhead to launch the kernel, etc.

Can someone explain why such a big difference?

I expected a significant delta, but definitely not this big.
Does it have anything to do with thread access between BAR1 memory vs FB memory?
Both regions of memory should be contiguous within the GPU.
Is it the for-loops?
The pointer increments?

Another test I ran, with an overall destination of 1GB (instead of 500MB) saw the multiple cudaMemcopies take ~78ms while the kernel version took ~32 seconds.

Thanks for any insight.

Robert_Crovella · November 5, 2018, 7:48pm

What does that mean exactly? You are using GPUDirect RDMA to transfer data from a 3rd party device to the GPU? Or do you simply mean you are copying data from host to device? Or are you referring to a copy on the same device?

You can transfer 500MB in microseconds? That doesn’t seem plausible unless you are referring to a device-to-device copy on the same device.

That is a really bad way to do data copying on the GPU, if you are doing one struct assignment per thread.

nunez.juan · November 5, 2018, 8:00pm

Thank you for the reply!

Yes GpuDirect from FPGA over PCIe.
Very possible given GPU memory speeds - I was able to copy 1GB in 11ms according to NSIGHT.
– cudaMemcopy
– 11,684.119 microseconds
– 87,640.3 MB/s data rates
Good information. What is the preferred method (suggested way)?

Robert_Crovella · November 5, 2018, 8:06pm

I see. But that really has nothing to do with your question, right? Your questions about copy speeds have nothing to do with copying data over PCIe, correct?

So I guess you are referring to a device-to-device copy on the same device here (not over PCIe, not from host to device). You have not answered my question, but that’s the only way the data could make sense. And I’m amused that a generic reference to “microseconds” includes a range up to 11,000 microseconds, but I quibble.

If data were flowing over PCIe, any valid/proper measurement of data transfer throughput could not exceed PCIe throughput, which is on the order of 12.5GB/s for a Gen 3 x16 link. 500MB of data could not be any faster than 0.5/12.5 s = 1/25 sec = 40 milliseconds (approx).

You want adjacent threads to read adjacent data, and write adjacent data. That will not happen with a struct-per-thread copy. Just read any resource on CUDA “coalescing” behavior.

nunez.juan · November 5, 2018, 8:37pm

Correct: For this question, I think GpuDirect is working just fine.

Robert_Crovella:

Very possible given GPU memory speeds - I was able to copy 1GB in 11ms according to NSIGHT.
– cudaMemcopy
– 11,684.119 microseconds
– 87,640.3 MB/s data rates

So I guess you are referring to a device-to-device copy on the same device here (not over PCIe, not from host to device). You have not answered my question, but that’s the only way the data could make sense. And I’m amused that a generic reference to “microseconds” includes a range up to 11,000 microseconds, but I quibble.

If data were flowing over PCIe, any valid/proper measurement of data transfer throughput could not exceed PCIe throughput, which is on the order of 12.5GB/s for a Gen 3 x16 link. 500MB of data could not be any faster than 0.5/12.5 s = 1/25 sec = 40 milliseconds (approx).

Sorry for my vagueness, I failed to make it clear.

Once the data is on the GPU, I am doing a DevtoDev (Bar1 region to FB memory region) copy. This is so that GpuDirect RDMA buffer can be overwritten while the second buffer can be used for “data-processing” - in theory by another kernel (that portion all still tbd).

I will look into this topic. Since the memory allocated is contiguous and the threads are also “contiguous” (1block, 10threads on the x within the block), I though I would be OK.

Robert_Crovella · November 5, 2018, 8:48pm

You don’t want to be using 10 threads per block either. You should choose threadblock sizes as whole-number multiples of 32. If you google “cuda copy kernel” I think you’ll come across a lot of useful reading and examples.

nunez.juan · November 5, 2018, 9:03pm

More background:

I give the FPGA a page-table with addresses to each GPU page I have allocated.
The FPGA has limited resources and, for the sake of this thread, let’s say the FPGA can only store a page-table large enough to reference 10 pages on the GPU.
This is why I then decided to use 10 threads in the Kernel - and since all they are doing is copying data, dev-to-dev, at GPU Page sizes, I didn’t think there would be any inefficiency.
– I obviously have a lot to learn, still.

nunez.juan · November 6, 2018, 5:46pm

So I looked up Coalesce and I’ve changed my kernel around:

Each thread will copy a DWORD (4 bytes)
Each thread will loop x number of times based on the number of GPU pages that need to be copied to fill the destination buffer.

Scenario:
10 source GPU pages to fill 1GB of destination data.

Source Buffer:

655360 B (10 * 64KB pages)
655360 / 4 bytes per thread = 163840 threads needed
163840 threads / 1024 threads per block = 160 blocks.

Destination Buffer:

1073741824 B (1GB)
1GB / 4 (bytes) = 262144 threads
262144 threads / 1024 (threads per block) = 262144 blocks

1GB / 64KB = 1638 source buffers to fill 1 destination buffer.
1GB % 64KB = 262144 left over bytes
– 262144 / 4 bytes per thread = 65536 left over threads
– 65536 / 1024 threads per block = 64 “left over” blocks

So my kernel, with all 163840 threads, will loop 1638 times to copy 1638 source buffers to fill 1 destination buffer.

Then threads 0 - 65535 will do a final copy to take care of the “left over” data.
– Threads 65536 - 163839 will simply exit.

All this take about 6 seconds.

Not really good numbers and the maintenance of the code, for new eyes, may not be straight forward.

Is it possible for a Quadro P4000 to generate an interrupt from a CUDA Kernel to notify Host software?

I tried using a windows condition variables but that wasn’t working out.

Topic		Replies	Views
Is simultaneous D2D mem copy possible? CUDA Programming and Performance	15	2051	February 21, 2019
GPU Communication Protocol CUDA Programming and Performance	16	6267	May 17, 2010
Slow memcpy performance in dual-CPU, 10 GPU system CUDA Programming and Performance cuda , nsight , gpu	24	2248	January 18, 2023
Slow device to device memory copy CUDA Programming and Performance	7	632	March 17, 2019
streams vs. direct use of zero copy memory CUDA Programming and Performance	14	13128	March 30, 2011
how to speed up? data transfer CUDA Programming and Performance	22	3802	April 5, 2011
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10662	January 4, 2011
Copies between CPU and GPU CUDA Programming and Performance	8	5354	November 3, 2009
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	554	March 30, 2024
CUDA 8.0 CudaMemcpy with Pageable Memory CUDA Programming and Performance	13	3205	December 6, 2016

Slow Memory Copies

Related topics