Very simple kernel gives wrong results sometimes See code in thread

Hi,

I have an extremely simple kernel and if I run it multiple times on the same input it sometimes gives wrong results.

The kernel basically does one step of a matrix transposition (it computes one row of the transposed matrix). I call it K times so it computes the whole transposed matrix. And if I repeat this whole transposition process a sufficient number of times, an error will occur somewhere in the computation (during the first 10 iterations most of the time, it rarely gets above 20 iterations, but it seems to be completely nondeterministic).

I’ve seen lots of weird behaviour when I indexed arrays out of bounds or when there were race conditions, but I don’t see any of that in this code. Maybe I’m just dense, but I’m stumped with this problem.

Also, please note that matrix transposition isn’t my aim, this is just what the problem has been narrowed down to.

Here’s the whole program that produces the errors for me reliably:

[codebox]#include

global void MStep_kernel1(unsigned int N, unsigned int K, unsigned int i, double *logassignments_d, double *tmp)

{

for (unsigned int j=blockIdx.xblockDim.x+threadIdx.x; j<N; j+=gridDim.xblockDim.x)

tmp[j]=logassignments_d[j*K+i];

}

int main()

{

unsigned int N,K,iter;

double *la_h,*la_d;

double *tmp1_h,*tmp2_h,*tmp_d;

K=44;

N=100*K;

la_h=new double[N*K];

tmp1_h=new double[N*K];

tmp2_h=new double[N*K];

cudaMalloc((void**)&la_d,NKsizeof(double));

cudaMalloc((void**)&tmp_d,N*sizeof(double));

for (unsigned int i=0; i<N*K; i++)

la_h[i]=(double)i;

cudaMemcpy(la_d,la_h,NKsizeof(double),cudaMemcpyHostToDevi

ce);

double *swap;

iter=100;

for (unsigned int a=0; a<iter; a++)

{

  for (unsigned int i=0; i<K; i++)

{

  MStep_kernel1<<<32,256>>>(N,K,i,la_d,tmp_d);

  

  cudaMemcpy(tmp1_h+i*N,tmp_d,N*sizeof(double),cudaMemcpyDevic

eToHost);

}

  if (a>0)

{

  for (unsigned int j=0; j<N*K; j++)

    {

      if (tmp1_h[j]!=tmp2_h[j])

	{

	  printf("error!\n  iteration: %u\n  element: %u -> (%u,%u)\n  values: %lf %lf -> (%d,%d) (%d,%d)\n",a,j,j%N,j/N,tmp2_h[j],tmp1_h[j],((int)tmp2_h[j])/K,((int)tmp2_h[j])%K,((int)tmp1_h[j])/K,((int)tmp1_h[j])%K);

	  exit(0);

	}

    }

}

  swap=tmp1_h;

  tmp1_h=tmp2_h;

  tmp2_h=swap;

}

delete la_h;

delete tmp1_h;

delete tmp2_h;

cudaFree(la_d);

cudaFree(tmp_d);

return 0;

}

[/codebox]

And I do compile with --gpu-architecture sm_13 :)

The specs:

CUDA 2.2, GTX285, Debian 5.0.2

Thanks for any help in advance!

I had a strange problem that sounds a little like yours. See: http://forums.nvidia.com/index.php?showtopic=104270

The only suggestion that seemed to work was using CUDA 2.3 instead of 2.2.

I’ve compiled your code with : “nvcc --gpu-architecture sm_13 test.cu” and run it over a thousand of times on a GTX295 with cuda 2.2/ Ubuntu without any problems.

Thanks. I compiled with the same command, no other extra arguments, and the test always failed after a few iterations.

However, after installing the latest nvidia driver and cuda 2.3, the problem disappeared. Thanks for everyone’s help! :)