Application does not scale when using cl::Buffer-Object

Hello,

i have to following problem:

My application does not scale for multiple GPUs. It always is a bit slower on more GPUs than on less.

I could figure out, that a cl::Buffer-Object is causing this. I use the Buffer as follows:

First I create a usual array with malloc() which includes 20 elements (they are filled later):

int* pOverlap_region = (int*) malloc(80);

After it is filled I create the Buffer-Object:

cl::Buffer overlap_region = cl::Buffer::Buffer(
this->context.getOpenCLContext(), CL_MEM_COPY_HOST_PTR, 80, pOverlap_region, &err);

this->context.getOpenCLContext() returns the context.

Then it is set as an argument for the kernel:

err |= kernel.setArg(3, (cl::Buffer) overlap_region);

If this Buffer is created and not set as an argument, the application scales on multi-GPU.

Does anybody know why the behaviour is like this?

Thanks for your replies

OpenCL does not make many promises about buffers that are shared by multiple devices, although this is discussed briefly in Appendix A.1 of the OpenCL 1.0 specification.

In general it should be valid to use it on multiple devices at the same time as long as it is not modified - modifying shared buffers requires explicit synchronization (see appendix) or the results is undefined.

Another question is what the NVIDIA implementation of OpenCL does in such circumstances. If the NVIDIA implementation somehow serializes kernel calls on different platforms that share resources, this would be an explanation.

The easiest way to find out would be to use the profiler (either the computeprof application or the low-level interface that writes simple text log or csv formatted files. Looking at the time stamps for invocation and GPU start and end time, you could easily find out what happens.