I don’t think the standard defines what actually happens when you have N queues for a single device and use them concurrently (with many host threads). The only mentions I’ve found:
So, it should be possible to have many queues and one should not assume thread-safety over a single queue. This doesn’t directly imply that separate command queues are independent to the point of being thread safe.
In case of NVIDIA GPUs, the hardware will queue requests from all sources (kernels, draw calls) even coming from different processes, so there’s this additional layer of thread-safety built in, but it’s not specified that the OpenCL runtime must do this kind of queuing.
I think that for the part of command queue creation, buffer creation and other (discovering cldevices…) there should be no problems if data are separated;
I explain myself better:
If I have 2 threads and each one has its own context on which it has opened its own command queue BUT ON THE SAME DEVICE; there shouldn’t be ant problems (at least I suppose) because it is like I havetwo different code and with different data they should (and I repeat SHOULD because I’m supposing it and I’m not really sure of it) not interfere each other.
The only (maybe not the only but the one I’m interested in) dangerous point is when I fuse, when I blend the execution:
I explain this with this image:
THREAD 1 THREAD 2
During the execution, which one goes first? The one that come first? And if it is the one that come first, the execution is blocked for any other request?
It seems (and I want to be clear that I’m not sure of this but it is what I can barely see, so it’s a deduction) that it’s like the first option, that is a queue.
I say this only because, after the call to clenqueueNDrangeKernel, the first thread takes 2s (circa) and the second, although it’s the same operation, 4s.
So, I may be wrong (and probably I am), but it seems that when I start more than one execution on a CPU within an Opencl context (I say CPU because it’s the only device on which I’m working), it manages the requests as a queue, a FIFO queue: the first “execution request” (I’m sorry for my terminology :"> ) that comes is the first that will be served.
I’m going to work with GPU in some weeks and, if I discover something else, I’m going to report it here.
if i have 2 threads with different command queues each using a kernel instance unique to them, but the kernel code is same, but data is different, will the kernel instances run concurrently on the gpu?
You mean on a single GPU? Current GPUs don’t support concurrent kernel execution now matter how you play it. Fermi-based cards (which should become available in small numbers next month) will support concurrent kernels (up to four at a time) but I can’t tell if this will immediately work in OpenCL or only in CUDA.