I was wondering if anyone was able to provide more information over the programming guides etc. as to how clEnqueueWriteBuffer works under the hood.
Specifically, if I queue up, say, 100 transfers of small size (100KB) - do they all get grouped into a single transfer to the device? I am finding a 2ms limitation to the lower bound of the transfer speed for 100x enqueues of very small buffers. I have read up on the PCIe interface extensively to see if the issue is related to payload size (including the 8B/10B encoding) but there seems to be no relation.
I have transferred the same amount of data in a single clEnqueueWriteBuffer at much higher speeds(/lower latency). I am wondering, therefore, what happens in OpenCL when you queue so many transfers up?
I am of course using a clFinish() before stopping my clock…