<load data to SM (shared memory)>
syncthreads();
<process data>
syncthreads();
<set SM to 0>
syncthreads();
<write results into SM>
syncthreads();
write results from SM to global memory
Now, that is quite a bunch of syncthreads and I wonder if there is a faster way to set SM back to 0, like memset in STL or cudaMemset on host?
Well, for performance reasons, I use the SM to store the delta from the old solution to the new solution and afterwards the SM is added to the global solution. Anyway, if there is no faster way to zero the memory,I don’t need to change anything. Thanks for the feedback External Image