how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe

I’ll start with the question, then the background / motivation.

When calling


, an OpenCL context is provided, not an explicit device. So with the command

cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL)

how can the implementation know where to allocate the buffer memory? The answer would be obvious if there is only one device in the context. But what if there are multiple devices? I think want to ensure allocation on the particular device for which I plan to latter call



In practice, I don’t see any performance difference whether my context has three devices (CPU,AMD GPU, NVIDIA GPU) or just the NVIDIA GPU. But, performance is less than theoretical peak, so I want to understand how this works.

More background:

I’ve been testing a few different scenarios to process elements of very large arrays using OpenCL on my NVIDIA Quadro 4000 (PCIe Gen2 x16)

Consulting the NVIDIA OpenCL Best Practices Guide v 1.0 and knowing that my arrays on the host are bigger than GPU global memory and that I want to overlap PCI transfers with computing, I have implemented the following pattern for each patch of the larger host array:

Host Array -> OpenCL Allocated ‘pinned’ Host Buffer -> OpenCL Allocated non-host-accessible buffer. And then the opposite for the output.

My kernel right now is trivial. out[index] = in[index] + 1.0. My performance should be entirely PCIe bandwidth limited.

My machine is a Mac Pro OS 10.6.

I see simultaneous 1.2GB/s to and from transfer data rate. I’d like to get closer to the theoretical peak but not sure what else to do?

The text of the best practices guide suggests that the


call that I pasted in above allocates “GPU device GMEM” but how can this be true with multiple devices in a context?

Hi Noah,

let me try to answer your question as thorough as possible. Where is the buffer allocated when you call clCreateBuffer with a context? The short answer is: nowhere. The concept of buffers in OpenCL is a little more abstract than one might think at first. This seemingly lack of control of where actual memory resides is one of the best things OpenCL has that comes in handy when creating complex applications.

The most powerful capability of OpenCL context is that it provides memory coherency across all devices of the context (to some extent, that the programmer can break if he/she wishes). This means that the runtime keeps track of which device modified a buffer last, and as such has the most up-to-date version of it. If you simply create a buffer and start using it on device without calling clEnqueueWriteBuffer() you will see that everything works fine. This is because the context knows that the device that wishes to use the buffer has no copy of it, and most up-to-date version is inside host memory (it has just been created). Therefore the first time you call clEnqueueNDRangeKernel() contents of the buffer will be moved to the specific device. This behaviour is called lazy memory management, meaning that the runtime will not move data, just when it is absolutely neccessary. This can be overriden by calling explicit memory operations (clEnqueueRead/Write/Map…) to optimize application performance if required. Let’s say that you have a second device inside the context, if you synchronized inside your command queue and ensured that first device finished using (and modifying) the buffer, when you call clEnqueueNDRangeKernel() for the second device, runtime will know that most up-to-date buffer resides inside device#1 VRAM. Therefore it will fetch it from there. If NVIDIA OpenCL runtime is well designed, this should already be done using device-to-device cudaMemCopy() without host intervention. If you find this out if this is the case, that would be nice.

And why is this good? This gives the programmer a little more flexibility when using more memory than available on a GPU. If I have 1GB VRAM and I would like to use more with different kernels, than I can allocate far more buffers than VRAM I have, and the runtime will take care of switching buffers onto and down from device. One kernel might use entire 1GB, and second kernel use another set of 1GB, and they will not collide, because buffers do not reside on a specific device, but inside a context. I’m not sure if the runtime will actually swap buffer back to host memory, or you need to explicitly call clEnqueueReadBuffer(), but you might experiment with it. Trying to use more buffers at any one time than VRAM the device has, clEnqueueNDRangeKernel() will throw error CL_DEVICE_NOT_ENOUGH_RESOURCES (or something like that), meaning that not all buffers specified for the kernel can be present at once on the given device.

Breaking memory coherency can be done if you modify the same buffer on two devices at the same time. Then runtime cannot tell who has the newest version (and most likely will fetch from device that finished later). Should you wish to do something like this (and assuming you do not modify same parts of the buffer, which should never make sense) you should make use of sub-buffers. That way you can create one buffer for entire system, and create sub-buffers for devices. Then you send out sub-buffers to devices, and when you call clEnqueueReadBuffer() on host once all devices have finished, runtime will see that partitions of buffer (that are the sub-buffers) have been modified, and will assemble data by fetching the new data from each device, and you need not worry about the rest.

As for performance, I do not know which theoretical peak performance you are talking about. If you wish to hit PCI-E2.0 x16 limit, you will not be able to do that, as most present day GPUs cannot handle that high bandwidth when writing to VRAM from PCIE or reading from VRAM to PCIE. You should look at CUDA and OpenCL examples of memory movement and see what peak those have. (At most they will be around 12GB/s, which is considered extremely good for a GPU. I’m sure NV SDK has OpenCL based memory movement examples. You might want to look at those as they utilize some nice tricks (mostly ones you will not take the time to implement).

Hope I helped and cleared most things about buffers.


Wow. Thanks for that explanation.

I’ve decided to run my test program with some finer profiling implemented to see some clearer evidence of what is really happening. I’ll post more after that.

I do have a few immediate follow-up questions.

Since clEnqueueWriteBuffer() takes a command queue argument, presumably the run-time knows explicitly what device to transfer the buffer to. I can be assured that when that event is complete, data was actually placed on the device?

I wonder if I would see any different performance if I got rid of the trailing buffer I intended to be on the ‘device’ – the one created without a host memory pointer? I’ll try this. I would then not need a clEnqueueWriteBuffer() call.
The tricky part is that I want two sets of buffers ‘on the device’. One set is being computed while simultaneously the other set is undergoing to/from memory transfer. It is not clear to me how to achieve this without enqueuing my own buffer read/write commands.

You mentioned 12GB/s for observed performance. I understand PCIe Gen 2 x16 to give max 8GB/s each way. Is 12GB/s meant to be a sum of the two?


Forgive me for the 12GB/s, it was a mistake from my part. Most GPUs perform around 4GB/s when it comes to PCIE transfers.

If you would like to paralellize computation yes, you would need explicit writeBuffer commands. Since one context may have multiple devices, but a commandQueue only takes a single device as argument upon creation, the commandqueue you issue the WriteBuffer in decides uniquely which device you will do the copy to. Here is what you would have to do:

1.) Enqueue writeBuffer to device for first half of data (in blocking manner)
2.) Enqueue kernels to process half the data AND associate firstEvent
3.) Enqueue writeBuffer to device for second half of data AND associate secondEvent
4.) Enqueue kernels to process second half of data WITH a 2 long event waitlist {firstEvent, secondEvent}
5.) clFinish()

If you wish to place this routine into a for-cycle, you can do a circular event waitlist, so that the for cycle executes safely and efficiently. (This way there is an unnecessary waiting at clFinish, where first writeBuffer could already start executing while second kernel call is in progress. Entering such cycles takes a little thought in advance, but are far from impossible. Although this routine is a little redundant, as first writeBuffer is really unnecessary, as kernels won’t start executing without it and will be done implicitly (as it should be clear now). I wrote it out so it’s easier to understand and easier to put cycles around it.

CommandQueue in association with events and markers are really powerful tools that allow a lot of black magic to be done.

Actually as far as I know there is no memory coherency, at least according to the OpenCL specification. The only place where there is some coherency is with mapped pointers.
I’ve heard conflicting reports about it and should probably verify at some point. Behavior may also change between implementations. I have seen some behaviors that can driver the profiler and system crazy if people are not careful.

What happens with OpenCL is lazy allocation. The allocation happens only when the driver knows where to allocate. Thus, create buffer only specifies the requested size, but not who is going to use it. When you call a write command or a kernel that make use of the buffer along with a device an actual allocation takes place. Note that this can cause allocation errors on write or kernel launch due to the device running out of memory, despite the create buffer succeeding, memory allocation size problems and you may need to use different contexts for different devices, or one context for different devices under different scenarios for best performance. It will also cause a delay on first call to write or kernel as that call will also include the overhead for the malloc.

As for memory transfer speed, PCI-E gen 2 X16 has a theoretical limit of 8GB/s. Practical, that drops to around 5.5GB/s on pinned memory and 2.5GB/s-4.5GB/s on unpinned memory (performance can vary widely on different systems). This is due to latency, communication overhead and a bunch of other parameters.

PCI-E gen3 should go up to 16GB/s theoretical, which means 10-12 GB/s practical (pinned memory again).


I do not remember where I read all this about consistency inside a context, but I’ll try to find it. The only thing mentioned about this inside the spec is under memory model at the very end:

“Memory consistency for memory objects shared between enqueued commands is enforced at a synchronization point.”

This could easily (and not completely) summarize what I’ve been saying, although it can also refer to consystency on a single device. Nontheless, I’ll try to find the source of this information.

Just some follow up now that I have added more precise trace instrumentation to my test code.

When it comes to actual transfer over the PCIe bus as measured by a blocking call to clEnqueueReadBuffer() and clEnqueueWriteBuffer, I see 6.7 GB/s each direction to an AMD Radeon HD 5770. This data is following an initial priming cycle.

My observed transfer rate is going up several factors because it turned out my kernel execution time was much longer than I expected / it should be. The main issue turned out to be not doing adjacent access to global memory from my work items. I have a 2D kernel. When I made a trivial change to make it a 1D kernel with the same memory accesses and computation, the execution time went down by a factor of 10. There must be an issue with how I compute a 2D work-item’s offset into the global array versus how OpenCL groups work items. I’m computing 2D array offset in row-major form. I would think that compute devices would execute work items in the same way, progressing among the highest dimension the fastest. Based on my observation, my assumption is probably wrong. Does the OpenCL spec actually address this?
More explicitly, say you had a 4x4 ND range of work items and a compute device with four cores in a workgroup 0-3. In what order would the work items be processed by the cores 0-3?

Aside about CPU device execution: I noticed that the first ND Range execution per host thread takes almost one hundred times longer than subsequent ones. This is for 1M work items in the ND Range. I think it is notable (i.e. I’m baffled) that this initial start-up overhead is per calling thread not just for the very first clEnqueueNDRangeKernel() call on the same command queue. I think CPU OpenCL work items are processed by the OS as light-weight threads. I have a hypothesis that creating these threads and the infrastructure to support them is duplicated for every host calling thread.

I don’t have data for my Quadro 4000 because I upgraded to Lion 10.7 over the weekend and learned that there is no longer OpenCL support for that card. :(

The overhead on the first ndrange call is probably due to two things

  1. Memory allocation for the output buffer is done on the first kernel call. You can allocating it as r/w and do an enqueue write to it before the call to force allocation and see if that changes the running time.
  2. Final compilation of the kernel from ptx to binary and loading it into the gpu also happens on the first call. The compilation should only happen when you change the kernel and after that it should be cached though.

Ah. #1 explains perfectly why I see the long call time for the first kernel execution per thread. Each of my three threads has its own read-only input buffer and write-only output buffer. Thanks.