I know this topic has been discussed before but i’d like to have some more explanation…
The time taken to memcpy one byte, execute a blank kernel and read back one byte is very big. The portion of code I timed is the following:
cuMemcpyHtoDAsync(d_mem, h_mem, 1, stream); cuLaunchGridAsync(BlankKernel, 1, 1, stream); cuMemcpyDtoHAsync((void*)h_mem, d_mem, 1, stream); cuCtxSynchronize();
Using the timer from cutil.h, I get an execution time on the CPU of about 30 micro seconds. I’m running a TESLA C870 on rhel5.
For my application, I need to run a matrix multiplication under 50 us so 30 us of overhead is really bad… If I just time the two memcpy, it takes about 25 us, which is quite slow to only write and read one byte via PCIe. There has to be a way to do this faster.
Is there a way to go at a even lower level than the driver API? Can Nvidia provide information on how to access the GPU memory? Perhaps I could write my own PCIe->GPU driver… Or is there a way for another device (like a FPGA) to write and read memory on the GPU?
Any help would be greatly appreciated.