OpenCL multiple devices matrix multiplication

Hi all. I’m very new to the forums and also to OpenCL.

I’ve implemented a simple 2 matrix multiplication that works great on the two platforms (NVIDIA and AMD) and 3 devices at my disposal (GTX460, AMD Graphics Card and Intel Core Quad CPU).

However, my example was made to work only in one device at a time.

What i really am looking for is a guideline that will help me to change my implementation in orther that i can divide the workload into my two graphics cards.
Can someone give me a help? Point me to some examples?

Thank you very much.