Hi,
I am trying to run a program where I have 4 GPUs running different kernels. All kernels use data from a single array ‘ARR’. Now I have 2 options:
- I malloc and do memcpy the array ‘ARR’ on each of the devices and then free them.
- Or i use cudaHostRegister and register ARR(page aligned), run my kernels and Unregister the array.
I am using the 2nd approach. But I am not getting the right answer.
I am doing:
Register array ‘ARR’
cudaSetDevice(0);
run kernel 1
cudaSetDevice(1);
run kernel 2
cudaSetDevice(3);
run kernel 3
cudaSetDevice(4);
run kernel 4
Unregister(ARR);
Can you please tell what I am doing wrong. Must I register the array ‘ARR’ with all devices first, run kernels and then Unregister from all devices?
Thanks,
Vikram.