Using Host memory for multiple GPUs

Hi,
I am trying to run a program where I have 4 GPUs running different kernels. All kernels use data from a single array ‘ARR’. Now I have 2 options:

  1. I malloc and do memcpy the array ‘ARR’ on each of the devices and then free them.
  2. Or i use cudaHostRegister and register ARR(page aligned), run my kernels and Unregister the array.
    I am using the 2nd approach. But I am not getting the right answer.
    I am doing:

Register array ‘ARR’

cudaSetDevice(0);
run kernel 1

cudaSetDevice(1);
run kernel 2

cudaSetDevice(3);
run kernel 3

cudaSetDevice(4);
run kernel 4

Unregister(ARR);

Can you please tell what I am doing wrong. Must I register the array ‘ARR’ with all devices first, run kernels and then Unregister from all devices?

Thanks,
Vikram.

You need to register the memory with the [font=“Courier New”]cudaHostRegisterPortable[/font] flag.