Programming on Multiple GPU and Multiple CPUs


In our cluster: each unit has 4 GPUs & 2 8-core AMD CPUs. 48GB main RAM & additional 6GB per GPU is available.
To start with I want to test a reduction kernel on a SINGLE UNIT, then I will expand this for the entire cluster.
I want to write the reduction kernel that should exploit all the resources; namely all 4 GPUs and the two AMD CPUs.
I know MultiGPU programming in CUDA is possible but I am not able to understand how I can actually use the Two AMD CPUs with CUDA.

I would really appreciate if someone can point me to some resource that explains how to program multiple “multicore-CPUs” (and Multiple GPUs) with CUDA.