synchthreads problem ?

Hi, I try to run a code using the computation of the sequence of factorials modulo a prime number (1!, 2!, 3!, …, n!). I don’t use 0!.
This code was working until n=512, size of a thread block. After, sometimes the code was well done in comparison with the code in C++, sometimes no. Nevertheless, there is no random number so each time, the results must be the same, it was not the case. I remark it was because I forgot to use __synthreads(), usefull to be sure that each steps of the loop are done at the same time by the GPU, as I need an element modified in each step. So I add this __synthreads() at the end of my loop, and then my cuda code works well until n=2^13. Now I have the same problem than before but just since n=2^14. I don’t really understand why I have again the same problem I added __synthreads(). My code is okay, verified again and again, I just think it’s a problem of storing in the good order and I don’t really see why. If you have an idea, can you help me please ?

PS : just to give an idea of my code, this is it, with n size of the array Fact, e=log2(n), p a prime number. In attachements a picture of the computations done.

global void create_factorial_GPU(sfixn *Fact, sfixn n, sfixn e, sfixn p)
{
int k = blockIdx.x * blockDim.x + threadIdx.x;
int i, j, part, pos, base;
int L = 1;
int B = 2;

// suite
if (k < n/2)
{
// step 1
Fact[2k+1] = mul_mod(Fact[2k], Fact[2*k+1], p);
__syncthreads();

// next steps
for (i=1; i<e; i++)
{
  B *= 2;
  L *= 2;
  part = k / L;
  pos = k % L;
  j = L + part*B + pos;
  base = Fact[L + part*B - 1];
  Fact[j] = mul_mod(base, Fact[j], p); // computes base * Fact[j] mod p
  __syncthreads();
}

}
}
fact.png

Keep in mind that __syncthreads() does not synchronize threads in different blocks. I worry that you have a race condition in your step 1 when you access adjacent elements in the Fact array that another thread could be writing to. The fact that the code works erratically when you make various timing changes by adding __syncthreads() would also be consistent with a race condition.

I don’t really see what do you mean by race condition. For the first step, no threads need to write at the same place than another thread.
If I understand well, the threads of different blocks don’t synchronize together with __syncthreads(), so I tried also to use __threadfence() but without any success. Is there any instructions like __syncblocks() ?

No. Blocks don’t even necessarily execute in parallel, so there is no way to sync them.

Okay, I’ll try another way to do this computation. Thank you !