Lock causes deadloop bug report

#include <stdio.h>

#define N 50000

__global__ static void lockman(int *a){

	int thid=threadIdx.x;

	if(thid&31)return;

	if(thid){

  int i;

  for(i=0;a[i]==0;i++);

  a[0]=i;

	}else{

  while(a[0]==0);

  a[1]=a[0];

	}

}

int main()

{

	int *da;

	int ret[2]={-1,-1};

   cudaMalloc((void**)&da,sizeof(int)*N);

    cudaMemset(da,0,sizeof(int)*N);

    cudaMemset(da+N-1,1,sizeof(int));

    lockman<<<1,64,0>>>(da);

    cudaMemcpy(ret,da,sizeof(int)*2,cudaMemcpyDeviceToHost);

    cudaFree(da);

    printf("%d == %d\n",ret[0],N-1);

    printf("%d == %d\n",ret[1],N-1);

	return 0;

}

This code should run, but produces a ptx containing an infinite loop.

These are the two lines that I would check into. A change in data seems required, but you have a semi-colon behind the loops so nothing changes.

I think it could also be the 1,64,0 doesn’t that refer to the x,y,z size. So if x is 1, then the only thid that gets processed in 0, which puts you into an infinite while loop.

Those semi-colons are obviously intentional. The one in the for is used to delay execution of thread 0. The one in the while is for the lock.

Anyone who reads the doc carefully enough would know 1,64,0 means 1 block, 64 threads, 0 shared memory.

64 threads and if(thid&31)return; are used to get two threads in two warps to make the lock non-trivial.

I would check the while-loop in ptx, to make sure that it’s still there, and if so, that it really reads a[0] every iteration. It’s possible (and likely) that the compiler generates code that reads the value of a[0] into a register once, and then reuses it.

Before people rush to judgement that such optimization is not correct, I’d like to point out that it is perfectly consistent with the definition of how global memory reads and writes work. One SHOULD NOT rely on global memory for inter-block communication.

Now, you could use shared memory in your code snippet, since you’re only communicating within a block. Also, you can definately get the behavior you want by using volatile keyword to force rereading a value that’s just been read (I had something along the lines of your code working).

Paulius

#include <stdio.h>

#define N 3333

__global__ static void lockman(int *b){

	extern __shared__ int sh[];

	int /*volatile*/ *a=sh;

	int thid=threadIdx.x;

	if(thid==0){

  for(int i=0;i<N;i++)

  	a[i]=0;

  a[N-1]=1;

	}

	__syncthreads();

	if(thid&31)return;

	if(thid){

  int i;

  for(i=0;a[i]==0;i++);

  a[0]=i;

	}else{

  while(a[0]==0);

  b[0]=a[0];

	}

}

int main()

{

	int *da;

	int ret[2]={-1,-1};

 cudaMalloc((void**)&da,sizeof(int)*N);

    cudaMemset(da,0,sizeof(int)*N);

    cudaMemset(da+N-1,1,sizeof(int));

    lockman<<<1,64,N*4>>>(da);

    cudaMemcpy(ret,da,sizeof(int)*2,cudaMemcpyDeviceToHost);

    cudaFree(da);

    printf("%d == %d\n",ret[0],N-1);

	return 0;

}

Shared memory gets you no where, pal…

Only volatile shared memory does the trick.

Also, the shared memory version seems to reproduce another unrelated problem.
After it errors out, I opened an OpenGL program that renders a 100k+ triangles mesh, and my machine hangs briefly every few frames. I’m using WinXP and CUDA 1.0, WinFast 8800GTX.
I can’t tell whether it’s my card’s problem or CUDA’s problem, would anyone other please kindly test it?

Same volatile approach should work with global memory, but it would be much slower due to gmem latency.

Paulius