[SOLVED] What causes my OpenCL kernel serialized when running on multiple GPUs?

My single-context-multiple-queue OpenCL code (http://github.com/fangq/mcxcl/) has been behaving like this from the beginning - when running on multiple GPUs, the kernel execution is serialized. This only happens on NVIDIA GPUs; on AMD GPUs, they are executed concurrently.

The only way to run concurrently for my kernel was to create multiple host-threads, one context for each thread, involving changes like this

I also saw previous reports like this

so I had thought that this must be a bug in NVIDIA’s ocl driver - until I tested the multiGPU sample code in NVIDIA’s OpenCL SDK, i.e. this code

which also uses a single-context-multiple-queue structure, but it executes its simple kernel concurrently on my linux box (Ubuntu 16.04, 418.56, dual Titan V).

In comparison, my code shares a very similar structure, which can be summarized in the below snippet:

mycontext=clCreateContext(..,devnum,devIDs,..);
for(int i=0;i<devnum; i++){
    myqueue[i]=clCreateEqueue(mycontext,...);
}
for(int i=0;i<devnum; i++){
    arg1[i]=clCreateBuffer(mcxcontext,...);
    arg2[i]=clCreateBuffer(mcxcontext,...);
    ...
}
clBuildProgram(mcxprogram,...);
for(int i=0;i<devnum; i++){
    mcxkernel[i] = clCreateKernel(mcxprogram, "mcx_main_loop", &status),status);
    clSetKernelArg(mcxkernel[i], 0, sizeof(cl_mem), (void*)(arg1+i)); 
    clSetKernelArg(mcxkernel[i], 1, sizeof(cl_mem), (void*)(arg2+i));
    ...
}
for(int i=0;i<devnum; i++){
   clEnqueueNDRangeKernel(mcxqueue[i],mcxkernel[i],...)
   clFlush(mcxqueue[i]);
}
for(int i=0;i<devnum; i++){
   clFinish(mcxqueue[i]);
}

can someone let me know if you notice anything that I used in my code, in comparison with the oclSimpleMultiGPU.cpp example, that prevents my kernel from running concurrently? (again, it does run conurrently on AMD GPUs)

to prove that the code is currently running sequentially, you can run

git clone https://github.com/fangq/mcxcl.git
cd src
make clean
make

../bin/mcxcl --bench cube60 -G 1 -n 1e7  # running 1e7 photons using 1st GPU
../bin/mcxcl --bench cube60 -G 11 -n 1e7  # running 1e7 photons using 1st+2nd GPUs

on an NVIDIA system with multiple GPUs, the execution time of the last command is the same as the 1st one. I expect it to be 1/2 if the execution is concurrent.

never mind. mystery solved!

it turns out that the shared RO_MEM buffers had caused the serialization of the kernels! it was not the fault of a single context, as I always thought to be.

after duplicating those RO_MEM buffers for each device and assign those duplicated buffer points to each kernel (i.e. clSetKernelArg(mcxkernel[i],... (void*)(buf+i)) ), I was able to get concurrent execution on NVIDIA GPUs, no need for multi-threading/multi-context