Hi, I try to run a code using the computation of the sequence of factorials modulo a prime number (1!, 2!, 3!, …, n!). I don’t use 0!.

This code was working until n=512, size of a thread block. After, sometimes the code was well done in comparison with the code in C++, sometimes no. Nevertheless, there is no random number so each time, the results must be the same, it was not the case. I remark it was because I forgot to use __synthreads(), usefull to be sure that each steps of the loop are done at the same time by the GPU, as I need an element modified in each step. So I add this __synthreads() at the end of my loop, and then my cuda code works well until n=2^13. Now I have the same problem than before but just since n=2^14. I don’t really understand why I have again the same problem I added __synthreads(). My code is okay, verified again and again, I just think it’s a problem of storing in the good order and I don’t really see why. If you have an idea, can you help me please ?

PS : just to give an idea of my code, this is it, with n size of the array Fact, e=log2(n), p a prime number. In attachements a picture of the computations done.

**global** void create_factorial_GPU(sfixn *Fact, sfixn n, sfixn e, sfixn p)

{

int k = blockIdx.x * blockDim.x + threadIdx.x;

int i, j, part, pos, base;

int L = 1;

int B = 2;

// suite

if (k < n/2)

{

// step 1

Fact[2*k+1] = mul_mod(Fact[2*k], Fact[2*k+1], p);

__syncthreads();

```
// next steps
for (i=1; i<e; i++)
{
B *= 2;
L *= 2;
part = k / L;
pos = k % L;
j = L + part*B + pos;
base = Fact[L + part*B - 1];
Fact[j] = mul_mod(base, Fact[j], p); // computes base * Fact[j] mod p
__syncthreads();
}
```

}

}