I have doubt, which has always been back of my mind…
THIS MAY REALLY SEEM A NOOBISH QUESTION to some of you :"> , but am not a CS guy hence I don’t understand how this all actually works to the core so I want to learn…
if we have situation a:
shared double a;
b is some global double array variable;
// then we do read in a from b
a[tid] = b[indx]; // assume access coalesced
//compute (no sync between threads required)
a[tid] = a[tid]*2.0;
//read out to b
b[indx] = a[tid];
and we have Situation b:
b is some global double array variable;
// assume access from device memory as coalesced
b[indx] =b[indx]*2.0;
say if we launch say 32768 threads with 128 threads per block…
would situation A be faster than B ? If so then why ? … because in both case we have one read and write from global memory, and in situation A we have small overhead of copying from and to b to a also…
does doing 1 flop require multiple reads from the memory (it should be just be one read from intuition) ?
or is it because multiplying a global variable with some constant/variable is slower than mutliplying with a shared variable ?
I am not sure what is the exact answer to the above question…
Thanks
NA