enqueueWriteBuffer for multiple devices


In a multi-GPU environment, I experience problems with the enqueueWriteBuffer-method. The situation is as follows:

In my method to prepare the data, I iterate through all participating devices which occur in the context (Context is a class, in which Context and CommandQueues for each device are created, device is the device-ID returned by the Context-class). First, I create Buffers for each device, then I create the data I want to pass to the devices using the above Buffer. I use an array of size 2overlap_rangeparticipatingDevicessizeof(T). This array is supposed to be split since only some data is needed on each device (The first 2overlap_range elements are needed on the first device, the next 2*overlap_range elements are needed on the second device, and so on). Finally, the created data is upload to the device using enqueueWriteBuffer()-methods.

size_t offset = 0;

	size_t size = 0;

	int i, j, k;

        participatingDevices = this->context.getDeviceCount();

//Vector with Buffer-Pointers

	cl::vector<cl::Buffer*> overlap_regions(participatingDevices, NULL);

	for (device = 0; device < participatingDevices; device++) {


		overlap_regions[device] = new cl::Buffer(this->context.getOpenCLContext(), SCL_READ_ONLY, sizeof(T) * overlap_range * 2, NULL, &err);


                //*Here the data is create which has to be passed to each device

		size = 2 * overlap_range * sizeof(T);

		offset = device * 2 * overlap_range * sizeof(T);


		err = this->context.getCommandQueue(device).enqueueWriteBuffer(

				*overlap_regions[device], CL_FALSE, 0, size,

				(void*) (pOverlap_region + offset), NULL, NULL);


        //execute Kernel for each device

	for (device = 0; device < participatingDevices; device++) {

		executeKernel(device, overlap_regions);


In the executeKernel-method the overlap_regions[device] is set as a kernel argument using:

err |= kernel.setArg(3, *(overlap_regions[device]));

Then the kernel is executed.

When I run the programm after compilation, it works fine and correct for one device. But when I use two or more devices, it seems that the enqueueWriteBuffer-methods do not work for the second and following devices. Still, the calculation on the first device is correct.

I also tried to block enqueueWriteBuffer with CL_TRUE-flag or waited for the CommandQueue to finish after the call. None worked.

I cannot figure out what causes the problems. I can give additional information, when needed. The behaviour is only tested on a NVIDIA Tesla plattform, since it is the only one I can access which has multiple devices (4). It will most likely occur on another platform too. I appreciate your hints or help…