Running same kernel on multiple devices Spliting the same task on multiple devices

jbasic · October 14, 2009, 12:28pm

What is the general technique when you want more devices (in same context) to run the same kernel on same memory? How do you split the workload? (For example I want first device to calculate first half of the job, and second device second half of the job (in same memory).)

global_work_offset parameter in clEnqueueNDRangeKernel would really be handy for that, but currently isn’t supported.

Thank you

jcpalmer · October 14, 2009, 4:08pm

There is a oclSimpleMultiGPU example that comes with the SDK. That example does it the way you expressed you wanted.

I have chosen a multi-context solution, where each GPU runs in it’s own context. This method is good for me as I have enough work to run thousands of large worksize kernels, and do not care how long an individual execution takes (response time). I also have the advantage of running from Java, which has it’s own CPU based, blocking queue - thread pool API. I sub-class this, and have 1 thread per GPU-Context. This way each GPU can work at it’s own pace picking 1 unit of work off the single Java queue. Almost done with this. Good for machines like the MacBook Pro, where the GPU’s have a different # of processors & different clock speeds.

Either way, be prepared for multiple instances of the same kernel. You cannot specify parameters to a kernel on a command queue basis, which is why I am doing context pooling, not command queue pooling. You are also going to need multiple versions of some memory. The SDK sample code has multiples for kernels, and all memory. Not totally sure why they thought they needed a kernel per GPU. They are not even passing parameters.

Recommend modifying the SDK sample to only use 1 kernel, on a multi GPU machine, before concluding it is possible.

jbasic · October 14, 2009, 6:47pm

Thank you for the quick reply!

I must say I got a bit disappointed in OpenCL, since this sounds like a total waste of memory. (I’m doing calculation on millions of particles with lots of properties)

Is it possible to pass the same cl_mem to different kernels and let multiple devices use it at the same time? (I don’t care how OpenCL will manage the memory.) And then I could do something like:

__kernel void Function(int min_id, int max_id, ...)

{

  int id = get_global_id(0);

  if(id<min_id || id>max_id)

	return;

  ...

}

Or I could wait for global_work_offset… :(

jcpalmer · October 14, 2009, 8:21pm

What I think you might be missing, is that each GPU, has it’s own, on card, memory. You would never want to use host memory, as direct input to a kernel. That said, it has to exist in host memory (the PC’s memory) long enough to be copied to the each of the device’s memories. Yes, with a single context - multiple command queue scenario, it should be possible to only use 1 cl_mem for input, but I never tried it. If you specify CL_MEM_COPY_HOST_PTR when you create it, then it should be copied to each command queue(read device) in the context. If you do not, then make sure to copy it to each command queue yourself.

For output globals, it looks like you could go with 1 cl_mem as well, but reading a single cl_mem from device to host memory would definitely require specifying offsets and sizes into host memory, to avoid writing to the same part of memory for different command queues.

Breaking up a worksize between multiple devices requires synchronization, so I do not like it or use it. It would really nice if a book was written on this topic, maybe an OpenCL book period, but don’t count on it from me.

jcpalmer · October 14, 2009, 8:31pm

I correct myself, the offset and size correspond to the cl_mem, not the host memory. You would need to modify the address in host memory and have the correct size to keep from over writing the same spot. Depending on how you wrote your kernel, you might still need to use a non-zero offset too.

Once you expand beyond one device, you need to manage a lot more yourself. The question is: Do you really want to pay the development price, or is 1 GPU enough?

jbasic · October 15, 2009, 9:33am

I know that all devices need to allocate their own memory, it’s the host memory I’m worried. E.g. the problems of large memory paging to devices when having multiple same buffers (n * some memory on host is wasted); it makes difference when a buffer has 2GB.

It’s not problem for me to manage offsets when reading output buffers, I just need to know is OpenCL smart enough to manage 1 read/write buffer on all devices. I promise every device will write on their own buffer spots!
I think it’s testing and benchmarking time. External Image

GiGurra · October 23, 2009, 4:30pm

There is a oclSimpleMultiGPU example that comes with the SDK. That example does it the way you expressed you wanted.

I have chosen a multi-context solution, where each GPU runs in it’s own context. This method is good for me as I have enough work to run thousands of large worksize kernels, and do not care how long an individual execution takes (response time). I also have the advantage of running from Java, which has it’s own CPU based, blocking queue - thread pool API. I sub-class this, and have 1 thread per GPU-Context. This way each GPU can work at it’s own pace picking 1 unit of work off the single Java queue. Almost done with this. Good for machines like the MacBook Pro, where the GPU’s have a different # of processors & different clock speeds.

Either way, be prepared for multiple instances of the same kernel. You cannot specify parameters to a kernel on a command queue basis, which is why I am doing context pooling, not command queue pooling. You are also going to need multiple versions of some memory. The SDK sample code has multiples for kernels, and all memory. Not totally sure why they thought they needed a kernel per GPU. They are not even passing parameters.

Recommend modifying the SDK sample to only use 1 kernel, on a multi GPU machine, before concluding it is possible.

Sry to hijack this thread, but this perfectly fits my problem.

What approach did you use to call OpenCL from Java?

Did you generate JNI bindings and work from there?

I am currently using early versions of the JOCL project to call OpenCL

from JAVA, but it does not have clCreateDevice implemented yet,

just clCreateDeviceFromType, and I would like to explicitly have one context

per device like you have.

Appreciate all advice you can give. :)

Topic		Replies	Views
Transferring data between devices CUDA Programming and Performance	7	5386	August 10, 2011
how are 'device' buffers actually allocated with multiple devices in a context clCreateBuffe CUDA Programming and Performance	9	5027	December 14, 2011
[SOLVED] What causes my OpenCL kernel serialized when running on multiple GPUs? CUDA Programming and Performance kernel	1	873	August 8, 2020
Using multiple devices CUDA Programming and Performance	0	2620	January 19, 2010
memory sharing in a multi-gpu environment CUDA Programming and Performance	7	6654	April 4, 2010
Concurrent Kernel Execution CUDA Programming and Performance	6	2164	July 13, 2011
What happens when I allocate memory on multiple devices? CUDA Programming and Performance	3	5573	February 12, 2011
Inter-GPU comunication CUDA Programming and Performance	3	11450	May 19, 2011
How to share GPU memory from different host threads? CUDA Programming and Performance	6	2314	July 14, 2010
How to apply multiple kernels to buffered work-items before proceeding to rest of the NDRange discus CUDA Programming and Performance	0	3194	August 6, 2010

Running same kernel on multiple devices Spliting the same task on multiple devices

Related topics