Hi Noah,
let me try to answer your question as thorough as possible. Where is the buffer allocated when you call clCreateBuffer with a context? The short answer is: nowhere. The concept of buffers in OpenCL is a little more abstract than one might think at first. This seemingly lack of control of where actual memory resides is one of the best things OpenCL has that comes in handy when creating complex applications.
The most powerful capability of OpenCL context is that it provides memory coherency across all devices of the context (to some extent, that the programmer can break if he/she wishes). This means that the runtime keeps track of which device modified a buffer last, and as such has the most up-to-date version of it. If you simply create a buffer and start using it on device without calling clEnqueueWriteBuffer() you will see that everything works fine. This is because the context knows that the device that wishes to use the buffer has no copy of it, and most up-to-date version is inside host memory (it has just been created). Therefore the first time you call clEnqueueNDRangeKernel() contents of the buffer will be moved to the specific device. This behaviour is called lazy memory management, meaning that the runtime will not move data, just when it is absolutely neccessary. This can be overriden by calling explicit memory operations (clEnqueueRead/Write/Map…) to optimize application performance if required. Let’s say that you have a second device inside the context, if you synchronized inside your command queue and ensured that first device finished using (and modifying) the buffer, when you call clEnqueueNDRangeKernel() for the second device, runtime will know that most up-to-date buffer resides inside device#1 VRAM. Therefore it will fetch it from there. If NVIDIA OpenCL runtime is well designed, this should already be done using device-to-device cudaMemCopy() without host intervention. If you find this out if this is the case, that would be nice.
And why is this good? This gives the programmer a little more flexibility when using more memory than available on a GPU. If I have 1GB VRAM and I would like to use more with different kernels, than I can allocate far more buffers than VRAM I have, and the runtime will take care of switching buffers onto and down from device. One kernel might use entire 1GB, and second kernel use another set of 1GB, and they will not collide, because buffers do not reside on a specific device, but inside a context. I’m not sure if the runtime will actually swap buffer back to host memory, or you need to explicitly call clEnqueueReadBuffer(), but you might experiment with it. Trying to use more buffers at any one time than VRAM the device has, clEnqueueNDRangeKernel() will throw error CL_DEVICE_NOT_ENOUGH_RESOURCES (or something like that), meaning that not all buffers specified for the kernel can be present at once on the given device.
Breaking memory coherency can be done if you modify the same buffer on two devices at the same time. Then runtime cannot tell who has the newest version (and most likely will fetch from device that finished later). Should you wish to do something like this (and assuming you do not modify same parts of the buffer, which should never make sense) you should make use of sub-buffers. That way you can create one buffer for entire system, and create sub-buffers for devices. Then you send out sub-buffers to devices, and when you call clEnqueueReadBuffer() on host once all devices have finished, runtime will see that partitions of buffer (that are the sub-buffers) have been modified, and will assemble data by fetching the new data from each device, and you need not worry about the rest.
As for performance, I do not know which theoretical peak performance you are talking about. If you wish to hit PCI-E2.0 x16 limit, you will not be able to do that, as most present day GPUs cannot handle that high bandwidth when writing to VRAM from PCIE or reading from VRAM to PCIE. You should look at CUDA and OpenCL examples of memory movement and see what peak those have. (At most they will be around 12GB/s, which is considered extremely good for a GPU. I’m sure NV SDK has OpenCL based memory movement examples. You might want to look at those as they utilize some nice tricks (mostly ones you will not take the time to implement).
Hope I helped and cleared most things about buffers.
Cheers,
Máté