clEnqueueWriteBuffer under the hood

I was wondering if anyone was able to provide more information over the programming guides etc. as to how clEnqueueWriteBuffer works under the hood.

Specifically, if I queue up, say, 100 transfers of small size (100KB) - do they all get grouped into a single transfer to the device? I am finding a 2ms limitation to the lower bound of the transfer speed for 100x enqueues of very small buffers. I have read up on the PCIe interface extensively to see if the issue is related to payload size (including the 8B/10B encoding) but there seems to be no relation.

I have transferred the same amount of data in a single clEnqueueWriteBuffer at much higher speeds(/lower latency). I am wondering, therefore, what happens in OpenCL when you queue so many transfers up?

I am of course using a clFinish() before stopping my clock…



If you are able to use OpenCL 1.1 and depending on why you want to do small transfers clEnqueueWriteBufferRect might be a solution. I haven’t tried it, but I would hope that this aggregates the bits for a speedy transfer.