Multithreading and OpenCL what does really happen if...

Hi to everyone, I’m making some test with OpenCL and Multithreading.

I’m using the Pthread library and I have a question for all of you.

What does really happen when I make two threads running concurrently, once with it’s own data, but on the same Opencl device (CPU for example)?

I’m having no errors and the data are correct. (I make a validation after the opencl call, just to be sure of it).

My question is about what really happens in the architecure: the memory allocation, the command queue creation and all other stuff have their own parallel execution (each thread takes care of itself).

But during the execution, I have opened N command queue (if there are N threads running concurrently) on the same opencl device.

What happens in the device? It finishes a computation per time or it takes all computation and it runs that concurrently?

As far as I know, it should take ONLY ONE COMPUTATION PER TIME, and when it has finished with one computation, it takes the next one, but I would like to have a confirm by you.

Thank you, Paolo.

I don’t think the standard defines what actually happens when you have N queues for a single device and use them concurrently (with many host threads). The only mentions I’ve found:

and

So, it should be possible to have many queues and one should not assume thread-safety over a single queue. This doesn’t directly imply that separate command queues are independent to the point of being thread safe.

In case of NVIDIA GPUs, the hardware will queue requests from all sources (kernels, draw calls) even coming from different processes, so there’s this additional layer of thread-safety built in, but it’s not specified that the OpenCL runtime must do this kind of queuing.

I wonder how it would work when ran on a CPU?

thanks for your reply.

I think that for the part of command queue creation, buffer creation and other (discovering cldevices…) there should be no problems if data are separated;

I explain myself better:

If I have 2 threads and each one has its own context on which it has opened its own command queue BUT ON THE SAME DEVICE; there shouldn’t be ant problems (at least I suppose) because it is like I havetwo different code and with different data they should (and I repeat SHOULD because I’m supposing it and I’m not really sure of it) not interfere each other.

The only (maybe not the only but the one I’m interested in) dangerous point is when I fuse, when I blend the execution:

I explain this with this image:

THREAD 1 THREAD 2

mycontext(CPU) mycontext(CPU)

mycommandqueue mycommandqueue

mykernelcompilation mykernelcompilation

mymemoryobjects mymemoryobjects

enqueueNDrangeKernel

???

During the execution, which one goes first? The one that come first? And if it is the one that come first, the execution is blocked for any other request?

I explain this with this image:

It is like a queue?

ThreadN → ThreadN-1 → … → Thread2 → Thread1 → Thread0 → CPU

or

ThreadN →

ThreadN →

… → CPU

Thread2 →

Thread1 →

Thread0 →

It seems (and I want to be clear that I’m not sure of this but it is what I can barely see, so it’s a deduction) that it’s like the first option, that is a queue.

I say this only because, after the call to clenqueueNDrangeKernel, the first thread takes 2s (circa) and the second, although it’s the same operation, 4s.

So, I may be wrong (and probably I am), but it seems that when I start more than one execution on a CPU within an Opencl context (I say CPU because it’s the only device on which I’m working), it manages the requests as a queue, a FIFO queue: the first “execution request” (I’m sorry for my terminology :"> ) that comes is the first that will be served.

I’m going to work with GPU in some weeks and, if I discover something else, I’m going to report it here.

Thanks again, to Big_Mac and to everyone.

Paolo.

a related question i have is

if i have 2 threads with different command queues each using a kernel instance unique to them, but the kernel code is same, but data is different, will the kernel instances run concurrently on the gpu?

You mean on a single GPU? Current GPUs don’t support concurrent kernel execution now matter how you play it. Fermi-based cards (which should become available in small numbers next month) will support concurrent kernels (up to four at a time) but I can’t tell if this will immediately work in OpenCL or only in CUDA.

Still not sure about the thread-safety thing.