I am trying to run a program where I have 4 GPUs running different kernels. All kernels use data from a single array ‘ARR’. Now I have 2 options:
- I malloc and do memcpy the array ‘ARR’ on each of the devices and then free them.
- Or i use cudaHostRegister and register ARR(page aligned), run my kernels and Unregister the array.
I am using the 2nd approach. But I am not getting the right answer.
I am doing:
Register array ‘ARR’
run kernel 1
run kernel 2
run kernel 3
run kernel 4
Can you please tell what I am doing wrong. Must I register the array ‘ARR’ with all devices first, run kernels and then Unregister from all devices?