GPU: Quadro P4000 in TCC mode.
Scenario:
- Data is being RDMA’d into the GPU.
– Chunks of 1 GPU page - 64KB.
– 10 chunks per “cycle” - 640KB.
– 800 cycles - 500MB.
If I do a cudaMemcopy, 800 mem. copies, the copy is very fast, in the microseconds.
- Makes sense given GPU RAM speeds.
If I move this into a kernel.
- 1 block, 10 threads per block.
- each thread has a for-loop, looping 80 times each.
- Kernel arguments take in pointers to data.
- Pointers are cast to a struct with an array of 1 gpu page size.
struct buffer
{
uint8_t data[GPU_PAGE_SIZE];
} - use struct assignment to perform the deep-copy of the array.
{
*pDst = *pSrc;
}
This same 500mb copy goes from milliseconds to tens of seconds - about 27 seconds.
- Using NSIGHT, the delay is definitely in the kernel and not the overhead to launch the kernel, etc.
Can someone explain why such a big difference?
- I expected a significant delta, but definitely not this big.
- Does it have anything to do with thread access between BAR1 memory vs FB memory?
- Both regions of memory should be contiguous within the GPU.
- Is it the for-loops?
- The pointer increments?
Another test I ran, with an overall destination of 1GB (instead of 500MB) saw the multiple cudaMemcopies take ~78ms while the kernel version took ~32 seconds.
Thanks for any insight.