Lock causes deadloop bug report

asadafag · September 24, 2007, 1:38pm

#include <stdio.h>

#define N 50000

__global__ static void lockman(int *a){

	int thid=threadIdx.x;

	if(thid&31)return;

	if(thid){

  int i;

  for(i=0;a[i]==0;i++);

  a[0]=i;

	}else{

  while(a[0]==0);

  a[1]=a[0];

	}

}

int main()

{

	int *da;

	int ret[2]={-1,-1};

   cudaMalloc((void**)&da,sizeof(int)*N);

    cudaMemset(da,0,sizeof(int)*N);

    cudaMemset(da+N-1,1,sizeof(int));

    lockman<<<1,64,0>>>(da);

    cudaMemcpy(ret,da,sizeof(int)*2,cudaMemcpyDeviceToHost);

    cudaFree(da);

    printf("%d == %d\n",ret[0],N-1);

    printf("%d == %d\n",ret[1],N-1);

	return 0;

}

This code should run, but produces a ptx containing an infinite loop.

bpmuser · September 24, 2007, 4:01pm

These are the two lines that I would check into. A change in data seems required, but you have a semi-colon behind the loops so nothing changes.

I think it could also be the 1,64,0 doesn’t that refer to the x,y,z size. So if x is 1, then the only thid that gets processed in 0, which puts you into an infinite while loop.

asadafag · September 25, 2007, 2:40am

Those semi-colons are obviously intentional. The one in the for is used to delay execution of thread 0. The one in the while is for the lock.

Anyone who reads the doc carefully enough would know 1,64,0 means 1 block, 64 threads, 0 shared memory.

64 threads and if(thid&31)return; are used to get two threads in two warps to make the lock non-trivial.

paulius · September 26, 2007, 6:18pm

I would check the while-loop in ptx, to make sure that it’s still there, and if so, that it really reads a[0] every iteration. It’s possible (and likely) that the compiler generates code that reads the value of a[0] into a register once, and then reuses it.

Before people rush to judgement that such optimization is not correct, I’d like to point out that it is perfectly consistent with the definition of how global memory reads and writes work. One SHOULD NOT rely on global memory for inter-block communication.

Now, you could use shared memory in your code snippet, since you’re only communicating within a block. Also, you can definately get the behavior you want by using volatile keyword to force rereading a value that’s just been read (I had something along the lines of your code working).

Paulius

asadafag · September 27, 2007, 3:08am

#include <stdio.h>

#define N 3333

__global__ static void lockman(int *b){

	extern __shared__ int sh[];

	int /*volatile*/ *a=sh;

	int thid=threadIdx.x;

	if(thid==0){

  for(int i=0;i<N;i++)

  	a[i]=0;

  a[N-1]=1;

	}

	__syncthreads();

	if(thid&31)return;

	if(thid){

  int i;

  for(i=0;a[i]==0;i++);

  a[0]=i;

	}else{

  while(a[0]==0);

  b[0]=a[0];

	}

}

int main()

{

	int *da;

	int ret[2]={-1,-1};

 cudaMalloc((void**)&da,sizeof(int)*N);

    cudaMemset(da,0,sizeof(int)*N);

    cudaMemset(da+N-1,1,sizeof(int));

    lockman<<<1,64,N*4>>>(da);

    cudaMemcpy(ret,da,sizeof(int)*2,cudaMemcpyDeviceToHost);

    cudaFree(da);

    printf("%d == %d\n",ret[0],N-1);

	return 0;

}

Shared memory gets you no where, pal…

Only volatile shared memory does the trick.

asadafag · September 27, 2007, 3:20am

Also, the shared memory version seems to reproduce another unrelated problem.
After it errors out, I opened an OpenGL program that renders a 100k+ triangles mesh, and my machine hangs briefly every few frames. I’m using WinXP and CUDA 1.0, WinFast 8800GTX.
I can’t tell whether it’s my card’s problem or CUDA’s problem, would anyone other please kindly test it?

paulius · September 27, 2007, 7:43pm

Same volatile approach should work with global memory, but it would be much slower due to gmem latency.

Paulius