Data is being RDMA’d into the GPU.
– Chunks of 1 GPU page - 64KB.
– 10 chunks per “cycle” - 640KB.
– 800 cycles - 500MB.
If I do a cudaMemcopy, 800 mem. copies, the copy is very fast, in the microseconds.
Makes sense given GPU RAM speeds.
If I move this into a kernel.
1 block, 10 threads per block.
each thread has a for-loop, looping 80 times each.
Kernel arguments take in pointers to data.
Pointers are cast to a struct with an array of 1 gpu page size.
struct buffer
{
uint8_t data[GPU_PAGE_SIZE];
}
use struct assignment to perform the deep-copy of the array.
{
*pDst = *pSrc;
}
This same 500mb copy goes from milliseconds to tens of seconds - about 27 seconds.
Using NSIGHT, the delay is definitely in the kernel and not the overhead to launch the kernel, etc.
Can someone explain why such a big difference?
I expected a significant delta, but definitely not this big.
Does it have anything to do with thread access between BAR1 memory vs FB memory?
Both regions of memory should be contiguous within the GPU.
Is it the for-loops?
The pointer increments?
Another test I ran, with an overall destination of 1GB (instead of 500MB) saw the multiple cudaMemcopies take ~78ms while the kernel version took ~32 seconds.
What does that mean exactly? You are using GPUDirect RDMA to transfer data from a 3rd party device to the GPU? Or do you simply mean you are copying data from host to device? Or are you referring to a copy on the same device?
You can transfer 500MB in microseconds? That doesn’t seem plausible unless you are referring to a device-to-device copy on the same device.
That is a really bad way to do data copying on the GPU, if you are doing one struct assignment per thread.
Very possible given GPU memory speeds - I was able to copy 1GB in 11ms according to NSIGHT.
– cudaMemcopy
– 11,684.119 microseconds
– 87,640.3 MB/s data rates
Good information. What is the preferred method (suggested way)?
I see. But that really has nothing to do with your question, right? Your questions about copy speeds have nothing to do with copying data over PCIe, correct?
So I guess you are referring to a device-to-device copy on the same device here (not over PCIe, not from host to device). You have not answered my question, but that’s the only way the data could make sense. And I’m amused that a generic reference to “microseconds” includes a range up to 11,000 microseconds, but I quibble.
If data were flowing over PCIe, any valid/proper measurement of data transfer throughput could not exceed PCIe throughput, which is on the order of 12.5GB/s for a Gen 3 x16 link. 500MB of data could not be any faster than 0.5/12.5 s = 1/25 sec = 40 milliseconds (approx).
You want adjacent threads to read adjacent data, and write adjacent data. That will not happen with a struct-per-thread copy. Just read any resource on CUDA “coalescing” behavior.
Correct: For this question, I think GpuDirect is working just fine.
Sorry for my vagueness, I failed to make it clear.
Once the data is on the GPU, I am doing a DevtoDev (Bar1 region to FB memory region) copy. This is so that GpuDirect RDMA buffer can be overwritten while the second buffer can be used for “data-processing” - in theory by another kernel (that portion all still tbd).
I will look into this topic. Since the memory allocated is contiguous and the threads are also “contiguous” (1block, 10threads on the x within the block), I though I would be OK.
You don’t want to be using 10 threads per block either. You should choose threadblock sizes as whole-number multiples of 32. If you google “cuda copy kernel” I think you’ll come across a lot of useful reading and examples.
I give the FPGA a page-table with addresses to each GPU page I have allocated.
The FPGA has limited resources and, for the sake of this thread, let’s say the FPGA can only store a page-table large enough to reference 10 pages on the GPU.
This is why I then decided to use 10 threads in the Kernel - and since all they are doing is copying data, dev-to-dev, at GPU Page sizes, I didn’t think there would be any inefficiency.
– I obviously have a lot to learn, still.