I am running a kernel that would hypothetically look like this:
__global__ void complexKernel(unsigned int arraySize, float* c, float* c_1, float* c_2) {
int i = threadIdx.x;
for (int sample = 0; sample < 1024; sample++) {
//push back
c_2[i] = c_1[i];
c_1[i] = c[i];
__syncthreads();
//increment
c[i] = c[i] + 1;
__syncthreads();
//smooth
if (i > 0 && i < arraySize - 1) {
c[i] = (c[i] + c[i] + c[i + 1] + c[i - 1]) * 0.25;
}
__syncthreads();
}
}
Because I need synchronization at multiple steps of each loop (this is just an example of the principle), I understand I must keep my kernel to 32 threads and on one block.
Is it sufficient to run this just as:
kernel<<<1, 32, 0, stream>>>(parameters...)
And this will guarantee my __syncthreads();
are all successful and I will only be in one block where the perfect synchronization is possible?
Or is there anything else I must do? Thanks.