I have a large array, say, the number is 20 millions maybe.
Now I want to multiply each element by 2, the block size is 512, I code it like this:
int bx = blockIdx.x; int tx = threadIdx.x; binary[bx<<9+tx]= binary[bx<<9+tx]<<1; __syncthreads();
Will this give me good performance? I think maybe I should deal with share memory or other tricky things?