Deadlock that shouldn't occur

Hi everyone,

I am being puzzled due to a infinite loop that occurs for a single thread every couple of times I run the kernel shown below. Maybe I start of with what I am trying to do: I am implementing a cuckoo hashing scheme in CUDA that tries to use a waterfall model on how to place the elements of one bucket into several hash tables, i.e. an element that was already placed into a spot might get displaced later on. If that happens the thread monitoring the displaced element tries to find a spot in the next hash table. By doing so it might displace another element which then will have to find a new location in the next table, and so on…

The next table index is chosen in a round-robin fashion, hence elements that were displaced in the last hash table try to find a location in the first one again.

Everything works fine, but it might happen that from time to time a single thread runs into an endless loop which shouldn’t be possible (at least to my knowledge). Here is the cuckoo hashing function:

[codebox] device void cuckooHashElement(

	const int2& element, int2 cuckooTables[][CuckooTableSize],

	const int2* hashFunctions, uint& cycleCounter, bool& conflictsExist,

	const uint& maxCuckooIterations)

{

	uint lastTableIndex = 1; // doesn't matter as long as it is not equal to the currentTableIndex.

	uint currentTableIndex = 0;

	uint hashedValue;

	do

	{

		__syncthreads();

		if (threadIdx.x == 0)

		{

			// reset ready flag.

			conflictsExist = false;

		}

		// if we need to check a new table -> write into the hashed position

		// (this position may be written by many others as well).

		if (lastTableIndex != currentTableIndex)

		{

			hashedValue = hashObject(hashFunctions[currentTableIndex], element.x, CuckooTableSize);

			cuckooTables[currentTableIndex][hashedValue] = element;

			lastTableIndex = currentTableIndex;

		}

		// sync threads so that we can check which thread won the write competition.

		__syncthreads();

		if (cuckooTables[currentTableIndex][hashedValue].x != element.x)

		{

			// I lost, hence update block flag and set next table to use

			conflictsExist = true;

			currentTableIndex++;

			// TODO: replace that with a faster method (modulo is pretty 

			// expensive on the gpu) use bitmask & if (in case of 3)!

			currentTableIndex %= 3;

		}

		if (threadIdx.x == 0)

		{

			cycleCounter++;

		}

		__syncthreads();

	} while (conflictsExist && cycleCounter < maxCuckooIterations);

}

[/codebox]

It might happen that a there was no configuration found within maxCuckooIterations such that every element has a unique spot, or to put it in other words: It might be that there exists a cycle where there are exactly #cuckooTableCount+1 elements that share the same positions in all #cuckooTableCount tables and hence there is no possible configuration with these hash functions. In this case there are 3 cuckooTables. If that happens, the hash functions are reinitialised and this method is called again.

[codebox]for (; hashFuncInitCounter[0] < maxHashFunctionGenerations; hashFuncInitCounter[0]++)

		{

			cuckooHashElement(element, cuckooTables, hashFunctions, 

				cycleCounter[0], conflictsExist[0], MaxCuckooIterations);

			if (conflictsExist[0])

			{

				if (threadIdx.x < CuckooTableSize)

				{

					for (uint i=0; i < 3; i++)

						cuckooTables[i][threadIdx.x] = emptyElement;

				}

				if (threadIdx.x == 0)

				{ // reinitialise the hash functions and retry

					cycleCounter[0] = 0;

					hashFuncInitCounter[0]++;

					conflictsExist[0] = false;

					initHashFunctions(hashFunctions, d_magicNumbers);

				}

				__syncthreads();

			}

			else

				break;

		} //  (uint hashInitIt=0; hashInitIt < maxHashFunctionGenerations; hashInitIt++)

[/codebox]

Now to the actual error:

It might happen that a single thread is getting stuck in the cuckooHashElement function although all others are finished. Even more strange, the 4 times this has happened, it was always the thread with id 2 (of different blocks).

I have no idea as to why this can even possibly happen, do you guys?

Thanks for any help! :-)

Marius

Not sure if it is important but you have a write-after-read hazard in your second code. You have:

if (conflictsExist[0])  <-- read

{

   ...

  if (threadIdx.x==0)

  {

	...

	conflictsExist[0]=false;  <-- write

  }

  _syncthreads(); <-- barrier that may not be reached by all threads

}

Barrier that is not reached by all threads may cause that other threads stop on different barriers (although this behaviour is not supported, it just happens like that on current hardware). As a result, threads may continue to stop on different barriers, one ahead of others.

The ocelot emulator and possibly the nvidia emulation mode (-emudebug) will detect this kind of error if it exists in your program.

Hey guys,

thank you for your hints!

Unfortunately, the write after read hazard isn’t the cause of the problem (though it was another potential problem). In the last days I tried a couple of different approaches: Rewriting the method itself, replacing the shared memory with global memory and debugging the whole thing via Nexus on different Tesla cards (Tesla C1060s).

None of the approaches mentioned really cleared my mind on what is the exact root cause for my problem. I am down to a min test case where I have only 5 elements hashed into one bucket and with initially fixed secondary hash functions. The size of each cuckooTable is limited to 2. In this mini scenario I found that:

If the element of thread 0 is successfully written in the first try (cuckoo table 0), then the following happens:

    If I am not debugging -> endless loop, where the first d threads are waiting at the __syncthreads() after the while(cycleCounter < MaxCuckooIterations)"-loop and the other threads with threadIdx.x > d will loop endlessly through the while loop.

    If I am debugging and watching thread with index 0 then I see that this thread is always successfull in writing to the shared cuckooTables[0][1], the other threads realise they weren’t successfull and set conflictsExist = true, hence all threads iterate a second time through the loop.

    In the second loop, thread 0 skips the writing phase and NO other thread writes to the cuckooTables!! (checked after stepping over the next __syncthread). This is something I really don’t understand, but it gets even weirder: Of course, no other thread set the loop criterion, BUT only thread 0 and maybe 1 or 2 other threads leave the loop, the rest stay in the while loop! Has anybody any idea why this may even be happening? Shouldn’t the rest of the warp skip the loop along with thread 0 in case the abortion criterion is set and published (via __syncthreads) to the rest of the block? Now to the weirdest part:

    If I am debugging and I am already in the 2nd iteration, but I am watching some other thread that didn’t suceed in the first round, then I see how other elements are written into the cuckooTables (which I didn’t see when following thread 0), but thread 0 already broke out of the loop and hence the abortion flags aren’t set anymore and the other threads are looping endlessly.

So my question boils down to: How can some other threads of the same warp (there is only one) edit a shared memory location, run into a sync barrier and the other threads still don’t see these changes?

Does the compiler write some optimisation that prohibits the update of these flags for some threads?

I am baffled, since I thought that __syncthreads does :

according to the CUDA Programming Guide section B.6.

But in my case it obviously doesn’t.

Any ideas?

P.S. If some admin would be so kind as to move this thread to “CUDA Programming and Development” sub forum. I opened that thread in the wrong forum, so sorry about that! O:)

Updated code of the critical kernel:

[codebox]global void cuckooHash(

const int2* bucketedObjects, uint elementCount,

int2* hashTable, uint bucketCount,

const uint* bucketItemCount, const uint* bucketStart,

int* failedFlag)

{

__shared__ uint hashFuncInitCounterAr[1];

__shared__ uint cycleCounterAr[1];

__shared__ bool conflictsExistAr[1];

__shared__ int2 hashFunctions[3];

uint &hashFuncInitCounter = hashFuncInitCounterAr[0];

uint &cycleCounter = cycleCounterAr[0];

bool &conflictsExist = conflictsExistAr[0];

// only the first thread is allowed to write the shared information.

if (threadIdx.x == 0)

{

	cycleCounter = 0;

	hashFuncInitCounter = 0;

	conflictsExist = false;

	for (uint i=0; i< 3; i++)

		hashFunctions[i] = d_defaultHashFunctions[i];

}

// FIXME: set the size of the tables dynamically according to cuckooTableSize (to do this dynamically I need to

// reserve total amount of shared memory needed prior to calling the kernel and then setting this value as the

// third arg. in << < > >>!

__shared__ int2 cuckooTables[3][CuckooTableSize];

// clear the table, otherwise crappy information might still be in the

// memory.

int2 emptyElement = make_int2(0,0);

if (threadIdx.x < CuckooTableSize)

{

	for (uint i=0; i < 3; i++)

		cuckooTables[i][threadIdx.x] = emptyElement;

}

// all threads wait until first one has set the necessary data.

__syncthreads();

// only if this thread is trying to hash an object that really belongs to the current bucket.

if (threadIdx.x < bucketItemCount[blockIdx.x])

{

	// get the element for this thread. (blockIdx.x = bucketNumber, threadIdx.x = offset)

	int2 element = bucketedObjects[bucketStart[blockIdx.x] + threadIdx.x];

	for (; hashFuncInitCounter < maxHashFunctionGenerations; hashFuncInitCounter++)

	{

		//cuckooHashElement(&element, cuckooTables, hashFunctions, 

		//	cycleCounter, conflictsExist, MaxCuckooIterations);

		// TODO: remove when working correctly!

		uint lastTableIndex = 1; // doesn't matter as long as it is not equal to the currentTableIndex.

		uint currentTableIndex = 0;

		uint hashedValue;

		while(cycleCounter < MaxCuckooIterations)

		{

			// reset ready flag.

			if (threadIdx.x == 0)

			{

				cycleCounter++;

				conflictsExist = false;

			}

			__syncthreads();

			// if we need to check a new table -> write into the hashed position

			// (this position may be written by many others as well).

			if (lastTableIndex != currentTableIndex)

			{

				hashedValue = hashObject(hashFunctions[currentTableIndex], element.x, CuckooTableSize);

				cuckooTables[currentTableIndex][hashedValue] = element;

				lastTableIndex = currentTableIndex;

			}

			// sync threads so that we can check which thread won the write competition.

			__syncthreads();

			int2 successfullElement = cuckooTables[currentTableIndex][hashedValue];

			if (successfullElement.x != element.x || successfullElement.y != element.y)

			{

				// I lost, hence update block flag and set next table to use

				conflictsExist = true;

				currentTableIndex++;

				// TODO: replace that with a faster method (modulo is pretty 

				// expensive on the gpu) use bitmask & if (in case of 3)!

				currentTableIndex %= 3;

			}

			__syncthreads();

			if (! conflictsExist)

			{

				break;

			}

		} // while(cycleCounter < MaxCuckooIterations)

		__syncthreads();

		if (conflictsExist)

		{

			if (threadIdx.x < CuckooTableSize)

			{

				for (uint i=0; i < 3; i++)

					cuckooTables[i][threadIdx.x] = emptyElement;

			}

			__syncthreads(); // to prohibit write-after-read hazards with the confictExists field.

			if (threadIdx.x == 0)

			{ // reinitialise the hash functions and retry

				cycleCounter = 0;

				hashFuncInitCounter++;

				conflictsExist = false;

				initHashFunctions(hashFunctions, d_magicNumbers);

			}

			__syncthreads();

		}

		else

			break;

	} //  (uint hashInitIt=0; hashInitIt < maxHashFunctionGenerations; hashInitIt++)

} // (threadIdx.x < bucketItemCount[blockIdx.x])

// ensure that the temporary tables are properly written before writing

// them out. 

// This is needed, since it might be that the number of elements in 

// one bucket is less than the size of the cuckooTables!

__syncthreads();

// Check if after all tries there

if (conflictsExist)

{

	if(threadIdx.x == 0)

		atomicOr(&*failedFlag, true);

	return;

}

// write the hashTables + hashFunction constants to global device memory.

// for coalesced reads later (according to the authors) on first write all first stage cuckoo tables, then the second stage ones, etc.s

// TODO: I don't believe them -> test if consecutive cuckoo tables are really worse, espec. if the query algo, first packs together the query keys that fall into the same bucket!

if (threadIdx.x <= CuckooTableSize)

{

	const uint bucketOffset = blockIdx.x * (3*CuckooTableSize +3);

	const uint cuckooTableStart = bucketOffset + 3;

	for (uint i=0; i < 3; i++)

	{

		// CuckooTableSize + 1 = one cuckooTable + hashFuncVals, hence until first '+' it is the offset of writing the i-th cuckooTable

		uint pos = cuckooTableStart + (CuckooTableSize * i) + threadIdx.x;

		if (threadIdx.x < CuckooTableSize)

			// write the elements of the cuckoo tables.

			hashTable[pos] = cuckooTables[i][threadIdx.x];

		else

			// write the values needed for the hash functions of the cuckoo tables.

			hashTable[bucketOffset+i] = hashFunctions[i];

	}

}

}

[/codebox]

The updated example still has a problem with __syncthreads(). You can’t use it in conditional code unless the condition evaluates identical in all threads of a block. The same applies to loops.
In your example, the __syncthreads() barriers are still subject to [font=“Courier New”] if (threadIdx.x < …)[/font] conditions.

To be more specific, there is this huge if statement

if (threadIdx.x < bucketItemCount[blockIdx.x])

with __syncthreads() somewhere deeper in few more (hopefuly nondivergent) branches.