Shared mem atomics Help needed to Fix this hang

I have been battling with atomics for past 2 days. I am trying to implement a simple thread-level spinlock on CUDA and am hitting roadblocks.

So, Here is the crux of my issue. Testing this requires compute 1.2 hardware (as it uses smem atomics).

The following code hangs plain on TESLA C1060. The kernel is called with 1 block, 32 threads.

ANd, if I replace the shared memory lock with “g_lock” – it all works fine.

My guess : The “totalParticipation” shared memory variable is NOT seen correctly by threads although it is volatile and proper “threadfence” and “__syncthreads” are done to avoid compiler optimization.

However, if I use g_lock, everything works.

Can some1 confirm this on your hardware and tell me what the solution is???

Can some1 tell me what is the mistake I am making in this code?

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

__global__ void checkKernel(void)

{

	__shared__ int sharedLock;

	__shared__ volatile int totalParticipation;

	if (threadIdx.x == 0)

	{

		atomicExch(&sharedLock,-1);

		totalParticipation = 0;

		__threadfence();

	}

	__syncthreads();

	

	for(int i=0; totalParticipation<5; i++)

	{

		if(atomicCAS(&sharedLock, -1, (int)threadIdx.x) == -1)

		{

			totalParticipation++;

			__threadfence();

			atomicExch(&sharedLock, -1);

		}

		__threadfence();

		__syncthreads();

	}

	

	if (threadIdx.x == 0)

	{

		result = totalParticipation;

	}

	 

	 __syncthreads();

	 return;

}

int main()

{

	cudaError_t err;

	int host=-123;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);		

}

Hi Sarnath,

All of your __threadfence() are not necessary because you’ve used __syncthreads().

I replaced [font=“Courier New”]atomicExch(&sharedLock, -1)[/font] within the loop by the simple [font=“Courier New”]sharedLock = -1[/font] and then it could work. However, I am still unable to explain why your original code cannot work despite my effort to analyze PTX code. I suspect that the atomic function on shared memory called by a sub-warp does not work properly.

I replaced atomicExch by atomicOr and got the same error.

May be I’ve got one more weird rule of thumb: never use atomic functions in conditional paths <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Never depend on intra-warp scheduling…

Do you mean we should never serialize the warp?

You can usually serialize warps without depending on scheduling–this is one of the cases where you can’t.

Thanks for your answers. But I think the code above does NOT assume anything on warp-divergence or sub-warp scheduling.

As cvn says, the use of shared_mem atomics() inside the serial path might be a cause of an issue.

If you could confirm this …, it would be really nice and yeah, if you could add it to the documentation it would be nice…

I will work on the spinlock code with your piece of advice.

Thanks,

Best Regards,

Sarnath

meaning what? That you cannot put an atomic inside an if statement!?

You know what serial path is, but compiler or device see no distinction between such serial path or anything else. It is just one thread processing part of code, and that can also occur if you write ‘if (threadIdx.x==0)’ !

If you hadn’t put __syncthreads() inside your for loop, I could suspect simmilar thing to the case I described in the other thread, but the way it is written now, I don’t see why it is hanging. But it is on my machine as well!

I added __syncthreads() at beginning of the for loop to be on the safe side - same result.

P.S. I just ended up with a blue screen while trying to evaluating this problem. I’ll have a stop now, sorry…

Dear Sarnath,

From my understanding your problem is like follows, isn’t it:

  • Letting the thread constellation do a job called [font=“Courier New”]processing_job()[/font] in parallel.

  • Once a thread finishes [font=“Courier New”]processing_job()[/font], it needs to do another job called [font=“Courier New”]reporting_job()[/font]. However, there is a restriction that there should be no more than one thread doing this job at the same time.

  • To achieve maximum speed-up, scheduling the job [font=“Courier New”]reporting_job()[/font] should be in FIFO-style. Those threads finishing [font=“Courier New”]processing_job()[/font] first should proceed to [font=“Courier New”]reporting_job()[/font] as soon as possible.

If your purpose is as above, the following solution may be applicable. Note that all threads within a warp MUST finish [font=“Courier New”]processing_job()[/font] at the SAME time, so it is not necessary to spinlock at thread level.

processing_job();

block_spinlock_loop();

warp_spinlock_loop();

// explicit intra-warp scheduling

for (k = 0; k < warp_size; k++)

	if (k == (threadIdx.x % 32))	  // 1-D block assumed

		reporting_job();

unlock_warp_spinlock();

syncthreads();

unlock_block_spinlock();

Finally I have figured out why atomic functions in Sarnath’s spinlock programs sometimes do not work properly: too many atomic instructions concurrently applied to the same memory location make the GPU muddled. A relaxation of this heavy scheduling will be a good WORKAROUND.

The following code is an example for your spinlock, Sarnath. On my GTX260 card, it takes about 20 sec for 65535x512 threads. Note that this is just a WORKAROUND as depending on intra-warp scheduling is dangerous (according to tmurray). A safe implementation would be based on hardware-communications between multiprocessors, which are not supported at the moment.

#include <stdio.h>

#include <stdlib.h>

#define MAX_WARP_SIZE 32

#define WAIT_TIME 10

__device__ volatile int lock = -1;

__device__ volatile int counter = 0;

__global__ void spinlol()

{

	__shared__ volatile int intraCTAlock;

	int k, tmp, warp_size;

	warp_size = blockDim.x - ((threadIdx.x / MAX_WARP_SIZE) * MAX_WARP_SIZE);

	if (warp_size > MAX_WARP_SIZE)

		warp_size = MAX_WARP_SIZE;

	// Block-level spinlock

	if (!threadIdx.x)

	{

BlockSpin:

		if (lock != -1)

		{

			// NOPs

			for (k = 0, tmp = 0; k < WAIT_TIME; k++)

				tmp += k;

			goto BlockSpin;

		}

		if (atomicCAS((int*)&lock, -1, blockIdx.x) != -1)

			goto BlockSpin;

	}

	__syncthreads();

	// Warp-level spinlock

	if (!threadIdx.x)

		intraCTAlock = -1;

	__syncthreads();

	if (!(threadIdx.x % MAX_WARP_SIZE))

	{

WarpSpin:

		if (intraCTAlock != -1)

			goto WarpSpin;

		if (atomicCAS((int*)&intraCTAlock, -1, threadIdx.x) != -1)

			goto  WarpSpin;

	}

#if (0)

	if (!(threadIdx.x % MAX_WARP_SIZE))

		counter += warp_size;

	__threadfence_block();

#else

#if (0)

	k = 0;

LoopStart:

	if (k != (threadIdx.x % MAX_WARP_SIZE))

		goto LoopCheck;

	counter++;

LoopCheck:

	__threadfence_block();

	if (++k < warp_size)

		goto LoopStart;

#else

	for (k = 0; k < warp_size; k++)

	{

		if (k == (threadIdx.x % MAX_WARP_SIZE))

			counter++;

		__threadfence_block();

	}

#endif

#endif

	// Unlock warp-level spinlock

	if (!(threadIdx.x % MAX_WARP_SIZE))

	{

		//atomicExch((int*)&intraCTAlock, -1);

		atomicCAS((int*)&intraCTAlock, threadIdx.x, -1);

		//intraCTAlock = -1;

	}

	// Unlock block-level spinlock

	__syncthreads();

	__threadfence();

	if (!threadIdx.x)

		atomicExch((int*)&lock, -1);

	__threadfence();

}

int main(int argc, char** argv)

{

	int hostcounter = -1;

	dim3 grid, block;

	if (argc != 3)

	{

		printf("Usage: <executable-name> N_Blocks N_Threads\n");

		return -1;

	}

	grid.x = atoi(argv[1]);

	block.x = atoi(argv[2]);

	cudaSetDevice(0);

	printf("Spawning kernel with %d blocks, %d threads_per_blocks\n", grid.x, block.x);

	spinlol<<<grid, block>>>();

	cudaThreadSynchronize();

	printf("err = %s\n", cudaGetErrorString(cudaGetLastError()));

	cudaMemcpyFromSymbol(&hostcounter, "counter", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("counter = %d\n", hostcounter);

}

Dear CVN,

Amazing work… THat too, u claim it finishes so fast. Absolutely brilliant!

I will check this on Monday when I reach office.

A BIG THANKS!!!

@PDan,
Thanks for testing this out! Truly appreciate your time on this!

Best Regards,
Sarnath

Hmm… This one affects only shared atomics… If I replace the shared lock with a global lock, everything works… So, there is definitely something wrong.

@Tim,

The program I have posted in first post – has no dependence on warp scheduling or anything. The __syncthreads() ensures that warps converge during each iteration of the FOR looop.

This is totally a new issue pertaining to shared mem atomics. If you could give your feedback, it would be useful.

Even if you do that, the spinlock loop will still hang for large thread constellation.

cvn,
That hang is just slowness and not a deadlock – which is acceptable in some sense (atleast for theoretical purpose). Ultimately, we need a fast solution like what u have posted (for practical purpose).

but that said,
we really see that shared mem atomics is causing some problem and this has nothing to do with warp-divergence or scheduling. And, I would like Tim to address this point – because if this is a bug, an errata needs to be published in the manual.

Thanks for your time,