My single-context-multiple-queue OpenCL code (GitHub - fangq/mcxcl: Monte Carlo eXtreme for OpenCL (MCXCL)) has been behaving like this from the beginning - when running on multiple GPUs, the kernel execution is serialized. This only happens on NVIDIA GPUs; on AMD GPUs, they are executed concurrently.
The only way to run concurrently for my kernel was to create multiple host-threads, one context for each thread, involving changes like this
I also saw previous reports like this
so I had thought that this must be a bug in NVIDIA’s ocl driver - until I tested the multiGPU sample code in NVIDIA’s OpenCL SDK, i.e. this code
which also uses a single-context-multiple-queue structure, but it executes its simple kernel concurrently on my linux box (Ubuntu 16.04, 418.56, dual Titan V).
In comparison, my code shares a very similar structure, which can be summarized in the below snippet:
mycontext=clCreateContext(..,devnum,devIDs,..);
for(int i=0;i<devnum; i++){
myqueue[i]=clCreateEqueue(mycontext,...);
}
for(int i=0;i<devnum; i++){
arg1[i]=clCreateBuffer(mcxcontext,...);
arg2[i]=clCreateBuffer(mcxcontext,...);
...
}
clBuildProgram(mcxprogram,...);
for(int i=0;i<devnum; i++){
mcxkernel[i] = clCreateKernel(mcxprogram, "mcx_main_loop", &status),status);
clSetKernelArg(mcxkernel[i], 0, sizeof(cl_mem), (void*)(arg1+i));
clSetKernelArg(mcxkernel[i], 1, sizeof(cl_mem), (void*)(arg2+i));
...
}
for(int i=0;i<devnum; i++){
clEnqueueNDRangeKernel(mcxqueue[i],mcxkernel[i],...)
clFlush(mcxqueue[i]);
}
for(int i=0;i<devnum; i++){
clFinish(mcxqueue[i]);
}
can someone let me know if you notice anything that I used in my code, in comparison with the oclSimpleMultiGPU.cpp example, that prevents my kernel from running concurrently? (again, it does run conurrently on AMD GPUs)
to prove that the code is currently running sequentially, you can run
git clone https://github.com/fangq/mcxcl.git
cd src
make clean
make
../bin/mcxcl --bench cube60 -G 1 -n 1e7 # running 1e7 photons using 1st GPU
../bin/mcxcl --bench cube60 -G 11 -n 1e7 # running 1e7 photons using 1st+2nd GPUs
on an NVIDIA system with multiple GPUs, the execution time of the last command is the same as the 1st one. I expect it to be 1/2 if the execution is concurrent.