Hi,

I have an extremely simple kernel and if I run it multiple times on the same input it sometimes gives wrong results.

The kernel basically does one step of a matrix transposition (it computes one row of the transposed matrix). I call it K times so it computes the whole transposed matrix. And if I repeat this whole transposition process a sufficient number of times, an error will occur somewhere in the computation (during the first 10 iterations most of the time, it rarely gets above 20 iterations, but it seems to be completely nondeterministic).

I’ve seen lots of weird behaviour when I indexed arrays out of bounds or when there were race conditions, but I don’t see any of that in this code. Maybe I’m just dense, but I’m stumped with this problem.

Also, please note that matrix transposition isn’t my aim, this is just what the problem has been narrowed down to.

Here’s the whole program that produces the errors for me reliably:

[codebox]#include

**global** void MStep_kernel1(unsigned int N, unsigned int K, unsigned int i, double *logassignments_d, double *tmp)

{

for (unsigned int j=blockIdx.x*blockDim.x+threadIdx.x; j<N; j+=gridDim.x*blockDim.x)

```
tmp[j]=logassignments_d[j*K+i];
```

}

int main()

{

unsigned int N,K,iter;

double *la_h,*la_d;

double *tmp1_h,*tmp2_h,*tmp_d;

K=44;

N=100*K;

la_h=new double[N*K];

tmp1_h=new double[N*K];

tmp2_h=new double[N*K];

cudaMalloc((void**)&la_d,N*K*sizeof(double));

cudaMalloc((void**)&tmp_d,N*sizeof(double));

for (unsigned int i=0; i<N*K; i++)

```
la_h[i]=(double)i;
```

cudaMemcpy(la_d,la_h,N*K*sizeof(double),cudaMemcpyHostToDevi

ce);

double *swap;

iter=100;

for (unsigned int a=0; a<iter; a++)

```
{
for (unsigned int i=0; i<K; i++)
{
MStep_kernel1<<<32,256>>>(N,K,i,la_d,tmp_d);
cudaMemcpy(tmp1_h+i*N,tmp_d,N*sizeof(double),cudaMemcpyDevic
```

eToHost);

```
}
if (a>0)
{
for (unsigned int j=0; j<N*K; j++)
{
if (tmp1_h[j]!=tmp2_h[j])
{
printf("error!\n iteration: %u\n element: %u -> (%u,%u)\n values: %lf %lf -> (%d,%d) (%d,%d)\n",a,j,j%N,j/N,tmp2_h[j],tmp1_h[j],((int)tmp2_h[j])/K,((int)tmp2_h[j])%K,((int)tmp1_h[j])/K,((int)tmp1_h[j])%K);
exit(0);
}
}
}
swap=tmp1_h;
tmp1_h=tmp2_h;
tmp2_h=swap;
}
```

delete la_h;

delete tmp1_h;

delete tmp2_h;

cudaFree(la_d);

cudaFree(tmp_d);

return 0;

}

[/codebox]

And I do compile with --gpu-architecture sm_13 :)

The specs:

CUDA 2.2, GTX285, Debian 5.0.2

Thanks for any help in advance!