Using __syncthreads() while using two shared arrays

My CUDA kernel looks somewhat like this. I am using two shared arrays. I am doing some sort of loop tiling inside the kernel. So, the first shared array needs to be updated with outer loop’s (block1) iteration, while the second shared array needs to be update with the iteration of the inner loop (block2).

However, it looks like I am not getting first shared array’s data right, while the second array’s data looks correct. I am using only __syncthreads after updating the second array in the hope that every threads will need to wait till this point. Do I need to use the __syncthreads() after the first array update (the line where I commented out the __syncthreads()). Using this here slows down the whole process significantly.

What should I do?

extern __shared__ unsigned long long int shared_matrix[];
extern __shared__ unsigned long long int shared_matrix2[];

for(int block1 = block_index; block1 < num_blocks; block1++){
   int i3_start = block1 * 32;
   for(int i = 0; i < group_size; i++) {
       shared_matrix[localIdx * group_size + i] = global_matrix[(block1 * 32 + localIdx) * group_size + i];
   }
   //__syncthreads(); Should I need this line since I have a following syncthreads later?
   for(int block2 = block1; block2 < num_blocks; block2++) {
       int i4_start = block2 * 32;
       
       for(int i = 0; i < group_size; i++) {
           shared_matrix2[localIdx * group_size + i] = global_matrix[(block1 * 32 + localIdx) * group_size + i];
       }
       __syncthreads();
       
       for(int i3 = i3_start; i3 < i3_max; i3++){
           for(int i4 = i4_start; i4 < i4_max; i4++) {
               f(shared_matrix[g(i3)], shared_matrix2[g(i4)]);

           }
       }
    }
}
  • its frequently the case that with a loop around shared load/usage cycles, at least two __syncthreads() statements are needed. The reads and writes often need to be protected from each other, as the loop proceeds. Since you have multiple shared arrays and multiple sets of loops, its quite possible you might need more than two.
  • the compute-sanitizer --racecheck tool is very useful for this type of work. You might want to learn how to use it.
  • I wouldn’t be able to make definitive statements based on incomplete snippets.