My CUDA kernel looks somewhat like this. I am using two shared arrays. I am doing some sort of loop tiling inside the kernel. So, the first shared array needs to be updated with outer loop’s (block1) iteration, while the second shared array needs to be update with the iteration of the inner loop (block2).

However, it looks like I am not getting first shared array’s data right, while the second array’s data looks correct. I am using only __syncthreads after updating the second array in the hope that every threads will need to wait till this point. Do I need to use the __syncthreads() after the first array update (the line where I commented out the __syncthreads()). Using this here slows down the whole process significantly.

What should I do?

```
extern __shared__ unsigned long long int shared_matrix[];
extern __shared__ unsigned long long int shared_matrix2[];
for(int block1 = block_index; block1 < num_blocks; block1++){
int i3_start = block1 * 32;
for(int i = 0; i < group_size; i++) {
shared_matrix[localIdx * group_size + i] = global_matrix[(block1 * 32 + localIdx) * group_size + i];
}
//__syncthreads(); Should I need this line since I have a following syncthreads later?
for(int block2 = block1; block2 < num_blocks; block2++) {
int i4_start = block2 * 32;
for(int i = 0; i < group_size; i++) {
shared_matrix2[localIdx * group_size + i] = global_matrix[(block1 * 32 + localIdx) * group_size + i];
}
__syncthreads();
for(int i3 = i3_start; i3 < i3_max; i3++){
for(int i4 = i4_start; i4 < i4_max; i4++) {
f(shared_matrix[g(i3)], shared_matrix2[g(i4)]);
}
}
}
}
```