Will this be fast?

I have a array, d_zero, in the global memory, it has element only 0 and 1, with length n.
I also have another two array, A and B, the same length with d_zero, in global memory.
I want to do this thing:
If d_zero[i]==0
should I use shared memory? I do not have much to share between different thread, I think. So if not, how can I make it fast? or it is fast using global memory only?

  1. This math does not require neighboring data. So you do not need shared memory.

  2. This operation is very simple and you are likely to be memory-bound.

  3. You can get rid of the “if” statement by rearranging your math to use multiplications and additions. How about something like this:

A[i] += B[i] * (1 - d_zero[i]);

If you make sure you coalesce reads and writes to device memory, this is very fast.