What I think you might be missing, is that each GPU, has it’s own, on card, memory. You would never want to use host memory, as direct input to a kernel. That said, it has to exist in host memory (the PC’s memory) long enough to be copied to the each of the device’s memories. Yes, with a single context - multiple command queue scenario, it should be possible to only use 1 cl_mem for input, but I never tried it. If you specify CL_MEM_COPY_HOST_PTR when you create it, then it should be copied to each command queue(read device) in the context. If you do not, then make sure to copy it to each command queue yourself.
For output globals, it looks like you could go with 1 cl_mem as well, but reading a single cl_mem from device to host memory would definitely require specifying offsets and sizes into host memory, to avoid writing to the same part of memory for different command queues.
Breaking up a worksize between multiple devices requires synchronization, so I do not like it or use it. It would really nice if a book was written on this topic, maybe an OpenCL book period, but don’t count on it from me.