What is the general technique when you want more devices (in same context) to run the same kernel on same memory? How do you split the workload? (For example I want first device to calculate first half of the job, and second device second half of the job (in same memory).)
global_work_offset parameter in clEnqueueNDRangeKernel would really be handy for that, but currently isn’t supported.
There is a oclSimpleMultiGPU example that comes with the SDK. That example does it the way you expressed you wanted.
I have chosen a multi-context solution, where each GPU runs in it’s own context. This method is good for me as I have enough work to run thousands of large worksize kernels, and do not care how long an individual execution takes (response time). I also have the advantage of running from Java, which has it’s own CPU based, blocking queue - thread pool API. I sub-class this, and have 1 thread per GPU-Context. This way each GPU can work at it’s own pace picking 1 unit of work off the single Java queue. Almost done with this. Good for machines like the MacBook Pro, where the GPU’s have a different # of processors & different clock speeds.
Either way, be prepared for multiple instances of the same kernel. You cannot specify parameters to a kernel on a command queue basis, which is why I am doing context pooling, not command queue pooling. You are also going to need multiple versions of some memory. The SDK sample code has multiples for kernels, and all memory. Not totally sure why they thought they needed a kernel per GPU. They are not even passing parameters.
Recommend modifying the SDK sample to only use 1 kernel, on a multi GPU machine, before concluding it is possible.
I must say I got a bit disappointed in OpenCL, since this sounds like a total waste of memory. (I’m doing calculation on millions of particles with lots of properties)
Is it possible to pass the same cl_mem to different kernels and let multiple devices use it at the same time? (I don’t care how OpenCL will manage the memory.) And then I could do something like:
__kernel void Function(int min_id, int max_id, ...)
{
int id = get_global_id(0);
if(id<min_id || id>max_id)
return;
...
}
What I think you might be missing, is that each GPU, has it’s own, on card, memory. You would never want to use host memory, as direct input to a kernel. That said, it has to exist in host memory (the PC’s memory) long enough to be copied to the each of the device’s memories. Yes, with a single context - multiple command queue scenario, it should be possible to only use 1 cl_mem for input, but I never tried it. If you specify CL_MEM_COPY_HOST_PTR when you create it, then it should be copied to each command queue(read device) in the context. If you do not, then make sure to copy it to each command queue yourself.
For output globals, it looks like you could go with 1 cl_mem as well, but reading a single cl_mem from device to host memory would definitely require specifying offsets and sizes into host memory, to avoid writing to the same part of memory for different command queues.
Breaking up a worksize between multiple devices requires synchronization, so I do not like it or use it. It would really nice if a book was written on this topic, maybe an OpenCL book period, but don’t count on it from me.
I correct myself, the offset and size correspond to the cl_mem, not the host memory. You would need to modify the address in host memory and have the correct size to keep from over writing the same spot. Depending on how you wrote your kernel, you might still need to use a non-zero offset too.
Once you expand beyond one device, you need to manage a lot more yourself. The question is: Do you really want to pay the development price, or is 1 GPU enough?
I know that all devices need to allocate their own memory, it’s the host memory I’m worried. E.g. the problems of large memory paging to devices when having multiple same buffers (n * some memory on host is wasted); it makes difference when a buffer has 2GB.
It’s not problem for me to manage offsets when reading output buffers, I just need to know is OpenCL smart enough to manage 1 read/write buffer on all devices. I promise every device will write on their own buffer spots!
I think it’s testing and benchmarking time. External Image