Memory lock with compute capability 1.0 Lamport's bakery algorithm on the GPU

Hello,

I’ve found an old thread where someone unsuccessfully tried to implement Lamport’s bakery algorithm on the GPU. GregD posted a code that he said works in emulation mode. I’ve written a similar implementation and indeed, it works very well in emulation mode. Unfortunately, it doesn’t work on the GPU – neither in shared nor in global memory. I understand why it doesn’t work in shared memory but I don’t see the reason why the global memory implementation doesn’t work. Are there any special instruction reorderings going on that prevent the code from running?

I’ve also tried declaring the pointers as [font=“Lucida Console”]volatile[/font]. As a result, the code seems to hang indefinitely (again, only on the device; it works fine in emulation mode). I suspect that [font=“Lucida Console”]volatile[/font] is simply ignored in some contexts. Is that possible?

All I really need is an [font=“Lucida Console”]atomicInc[/font] operation for devices with compute capability 1.0.

The relevant code looks as follows:

// Why use a union? So we can conveniently store all relevant data in one array

// without the use of another struct wrapping the pointer.

union mutex_info {

	struct mutex_thread_data {

		unsigned int priority;

		bool choosing;

	} data;

	unsigned int num_threads;

};

typedef volatile mutex_info* mutex_t;

__device__ void init_mutex_device(mutex_t* m, unsigned int const num_threads) {

	unsigned int const tid = threadIdx.x + 1;

	(*m)->num_threads = num_threads;

	(*m)[tid].data.choosing = false;

	(*m)[tid].data.priority = 0;

	__syncthreads();

}

__device__ void enter_critical_section(mutex_t m) {

	unsigned int const tid = threadIdx.x + 1;

	volatile mutex_info::mutex_thread_data& mm = m[tid].data;

	mm.choosing = true;

	while (not mm.choosing) { }

	for (unsigned int i = 1; i < m->num_threads + 1; ++i)

		if (m[i].data.priority > mm.priority)

			mm.priority = m[i].data.priority;

	++mm.priority;

	mm.choosing = false;

	for (unsigned int i = 1; i < m->num_threads + 1; ++i) {

		while (m[i].data.choosing) { }

		while (

			m[i].data.priority != 0 and

			(m[i].data.priority < mm.priority or

			 m[i].data.priority == mm.priority and i < tid)

		) { }

	}

}

__device__ void leave_critical_section(mutex_t m) {

	m[threadIdx.x + 1].data.priority = 0;

}

(For alignment reasons in shared memory, the structure is wrapped in a union and the number of threads resides in the first element of the [font=“Lucida Console”]mutex_info[/font] array! However, this isn’t relevant for global memory. It just means that to access thread x's data, we have to go to index x+1)

For reference, there’s a clean and well-described pseudo code of the algorithm.

A small example of using this code:

__global__ void minitest(

	unsigned int* counts,

	unsigned int* output,

	unsigned int* pout_idx,

	unsigned int const NUM_T,

	volatile mutex_info* mutex

) {

	unsigned int const x = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int const N = counts[x];

	init_mutex_device(&mutex, NUM_T);

	for (unsigned int i = 0; i < N; ++i) {

		enter_critical_section(mutex);

		unsigned int idx =( *pout_idx)++;

		leave_critical_section(mutex);

		output[idx] = x;

	}

}

/EDIT: Added a barrier as per prkipfer’s remark in the other thread to counter the fire-and-forget behaviour. However, this doesn’t have any impact on the problem.