Block synchronization issue on Fermi Global memory doesn't seem to by in sync across thread bloc

So I’ve been working on this problem for a while and still can’t seem to sort it out. Basically the code works fine when only one thread block is utilized. However, if 2 or more are utilized, the data in global memory is “corrupt”. But to explain further…

Essentially (in the attached code), I’m building a hash table in shared memory, and after I’m done populating it, building an aggregated hash table in global memory. When I’m using just one thread block, everything functions fine. However, when I use two or more, the data in the hash table is incorrect, or corrupt.

I’ve been reading online and trying to figure out what the potential issue is. From my understanding, one reason this could happen is because if the locks or data writes are not propagated properly. So I tried using __threadfence() and that did absolutely nothing.

I’m at a wits end and any help is appreciated!
src.zip (9.45 KB)

shouldn’t you be using __threadfence_block() for inter-block communication ?

Well, I tried threadfence and threadfence_system. However, it didnt seem to help. Also, when there is only one block everything works fine, so why would the problem be interblock communication?

try

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)

{

	// Find entry if it exists

	unsigned int hashValue = hash(key, table.count);

	

	lock(&globalHashLocks[hashValue]);

//	__threadfence(); // not necessary

	int bucketHead = table.entries[hashValue];

	int currentLocation = bucketHead;

	

	bool found = false;

	Entry currentEntry;

	while (currentLocation != -1 && !found) {

		currentEntry = table.pool[currentLocation];

		if (currentEntry.data.x == key) {

			found = true;

		} else {

			currentLocation = currentEntry.next;

		}

	}

	

	if (currentLocation == -1) {

		// If entry does not exist, create entry

		lock(globalFreeLock);

		int newLocation = (*globalFirstFree)++;

		__threadfence(); // you need to make sure that pointer globalFirstFree is updated before other threads can see it.

		

		unlock(globalFreeLock);

		

		Entry newEntry;

		newEntry.data.x = key;

		newEntry.data.y = count;

		newEntry.data.z = sum;

		newEntry.data.w = sumSquared;

		newEntry.next = bucketHead;

		// Add entry to table

		table.pool[newLocation] = newEntry;

//		__threadfence();  // not necessary

		table.entries[hashValue] = newLocation;

//		__threadfence();  // not necessary

	} else {

		currentEntry.data.y += count;

		currentEntry.data.z += sum;

		currentEntry.data.w += sumSquared;

		table.pool[currentLocation] = currentEntry;

//		__threadfence();  // not necessary

	}

	

	__threadfence(); // make sure DRAM buffer is flushed 

                   // or equivalently write of this thread is seen by other threads before unlock.

	unlock(&globalHashLocks[hashValue]);

//	__threadfence();   // not necessary

}

I modify

lock(globalFreeLock);

int newLocation = (*globalFirstFree)++;

unlock(globalFreeLock);

__threadfence();

and remove redudant __threadfence().

As far as I know, one __threadfence can guarantee order of write before and after critical section.

I gave it a try with the modifications you suggested. However, it still doesn’t seem to synchronize properly.

What baffles me is this. If I run the same code on a GTS250 (Compute 1.1 device, with all shared memory atomics removed), it works perfectly fine. The exam same code compiled for 2.0 and run on a C2050 or C2070 has a corrupt hash table. From everything I can figure, this is caused by the synchronization differences between the non-Fermi and Fermi architecture.

Does anyone have any more insight or suggestions? Is there anyway to get actual help from NVIDIA developers as to whats going on?

For example, I made the edits in the following file and compiled it for 1.1. If I run it on a compute 1.1 device, it works great (albeit slower). On a 2.0 device, the hash table is corrupt. Same number of blocks and same runtime environment every other given way. Only difference is the device it is run on.
gpuDB_kernels.cu (4.72 KB)

Do you have any shared memory use in your kernel (sorry but I am not downloading a zip archive to find out)? If you do, it could be compiler optimization which is breaking something. Have a look at page 3 of the Fermi compatibility guide in the toolkit. You might need to declare shared memory as volatile inside a kernel to make sure that results are actually written to shared memory rather than kept in register. That can break implicit intrawarp synchronization.

I compile your code

nvcc -Xptxas -v -arch=sm_11 gpuDB_kernels.cu

ptxas info : Compiling entry function ‘Z10testKerneljP5uint2S0’ for ‘sm_11’

ptxas info : Used 4 registers, 24+16 bytes smem

ptxas info : Compiling entry function ‘Z18computeAggregationjP5uint25TablePiS2_S2’ for ‘sm_11’

ptxas info : Used 23 registers, 8264+16 bytes smem, 20 bytes cmem[1]

You use more than 8KB per threadblock, so only one thread block per SM is used in sm11.

However you set size of threadblock = 512

#define BLOCK_SIZE 512          // Size of the thread blocks

        dim3 threads(BLOCK_SIZE, 1);

        //dim3 grid((size + BLOCK_SIZE - 1)/(BLOCK_SIZE * STRIDE_SIZE), 1);

        dim3 grid(2, 1);

cudaFuncSetCacheConfig(computeAggregation, cudaFuncCachePreferShared);

cudaEventRecord(start, 0);

        computeAggregation<<<grid, threads>>>(size, d_idata, gpuTable, d_hashLocks, d_freeLock, d_firstFree);

How could you run this on sm11. Resources are not enough except size of threadblock is 256

Yes, I do have shared memory usage in my kernel, however, all of it has been commented out in the version I posted that works on a GTS250 and does not on a C2050/C2070.

In trying to find a solution to my problem, I read the compatibility guide and the use of “volatile” to make sure shared memory writes are visible to all threads in the warp. However, utilizing volatile did nothing (perhaps I used it wrong), and since it works fine with 1 block (with multiple warps) and does not with 2 or more blocks, I don’t see how shared memory would be the problem.

I know it’s a hassle to download a zip and try running it, but if you wouldn’t mind, I would really appreciate it. At this point, I’m willing to pay for support if someone can help me figure out what the issue is.

Sorry, I forgot to mention that in order to run it on the GTS 250 in compute 1.1 mode, I had to make the following changes:

gpuDB.cu: line 9

#define BLOCK_SIZE 512

to

#define BLOCK_SIZE 256

and commented out the shared memory code:

__global__ void computeAggregation(unsigned int size, uint2 *idata, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)

{

	/*

	__shared__ int sharedHashLocks[HASH_ENTRIES];

	__shared__ int sharedFreeLock;

	__shared__ int entries[HASH_ENTRIES];

	__shared__ Entry pool[HASH_SIZE];

	__shared__ int sharedFirstFree;

	*/

	uint2 tuple;

	Entry sharedEntry;

	unsigned int index, warpThread;

	unsigned int threadID = (blockIdx.x * blockDim.x) + threadIdx.x;

	unsigned int stride = (gridDim.x * blockDim.x);

	/*

	if (threadIdx.x == 0) {

		sharedFreeLock = 0;

		sharedFirstFree = 0;

	}

	__syncthreads();

	for (index = threadIdx.x; index < HASH_ENTRIES; index += blockDim.x) {

			sharedHashLocks[index] = 0;

			entries[index] = -1;

	}

	__syncthreads();

	*/

	for (index = threadID; index < size; index += stride) {

		tuple = idata[index];

		for (warpThread = 0; warpThread < warpSize; warpThread++) {

			if ((threadIdx.x % warpSize) == warpThread) {

				add_to_global_hash_table(tuple.x, 1, tuple.y, (tuple.y * tuple.y), table, globalHashLocks, globalFreeLock, globalFirstFree);

				//add_to_shared_hash_table(tuple.x, tuple.y, sharedHashLocks, &sharedFreeLock, entries, pool, &sharedFirstFree);

			}

		}

		__syncthreads();

	}

	/*

	__syncthreads();

	for (index = threadIdx.x; index < HASH_SIZE; index += blockDim.x) {

		sharedEntry = pool[index];

		for (warpThread = 0; warpThread < warpSize; warpThread++) {

			if ((threadIdx.x % warpSize) == warpThread) {

				add_to_global_hash_table(sharedEntry.data.x, sharedEntry.data.y, sharedEntry.data.z, sharedEntry.data.w, table, globalHashLocks, globalFreeLock, globalFirstFree);

			}

		}

		__syncthreads();

	}*/

}

When I compile with

nvcc -Xptxas -v -arch=sm_11 gpuDB_kernels.cu

, I get:

ptxas info    : Compiling entry function '_Z10testKerneljP5uint2S0_' for 'sm_11'

ptxas info    : Used 4 registers, 24+16 bytes smem

ptxas info    : Compiling entry function '_Z18computeAggregationjP5uint25TablePiS2_S2_' for 'sm_11'

ptxas info    : Used 23 registers, 64+16 bytes smem, 16 bytes cmem[1]

Bump! I’m trying to find a solution to this problem, does anyone know of any ways to contact CUDA developers at NVIDIA? Or any avenues of official support?

As [font=“Courier New”]size[/font] is not a multiple of the blocksize, different threads will encounter different numbers of [font=“Courier New”]__syncthreads()[/font] in the loops. Behavior in the case is undefined and usually bad things happen.

ha!

?

Synchronizing blocks is a bad idea. Copy out to Application memory and merge the hash blocks in the CPU

That’s what I thought first too. But the code looks pretty clean (even though inefficient) in that respect.

try following code.

//int bucketHead = table.entries[hashValue];

    int bucketHead = atomicCAS(&(table.entries[hashValue]), -2, 1); // table.entries[hashValue] >= -1

    int currentLocation = bucketHead;

I think this is problem of cache coherence in L1.

If you do __threadfence(), then it guarantees L2-coherence.

So writing is O.K.

However if block 0 and block 1 try to update the same hashValue,

suppose block 0 locks, then it updates hash table. However old value,

table.entries[hashValue], may still be in L1-cache of block 1.

So even block 1 read table.entries[hashValue] after block 0 unlocks,

it cannot read latest value of table.entries[hashValue] because

L1-cache returns old value (this does happen in CPU).

Actually I have the same problem when I develop PFAC library http://code.google.com/p/pfac/.

I think that reading table.pool[currentLocation] is O.K. because no contention.

while (currentLocation != -1 && !found) {

        currentEntry = table.pool[currentLocation];

        if (currentEntry.data.x == key) {

            found = true;

        } else {

            currentLocation = currentEntry.next;

        }

    }

Edit: you should also use atomicAdd() on globalFirstFree

if (currentLocation == -1) {

        int newLocation = atomicAdd(globalFirstFree,1);

        Entry newEntry;

        newEntry.data.x = key;

        newEntry.data.y = count;

        newEntry.data.z = sum;

        newEntry.data.w = sumSquared;

        newEntry.next = bucketHead;

...

you should probably be using volatile

Good point, that may explain the hangups I’ve encountered. Reworked the code to use while loops and make sure all threads in the warp/block synchronize together. Code posted below, look good?

Yeah, what does that mean? I think there are some pretty awesome guys willing to help when they find the time. I try to do the same when I see a solution to a problem (however rare that is).

There isn’t so much synchronizing “across” blocks as much as data being coherent between them. Also, all performance gains will be lost if I have to reassemble and aggregate the hash tables computed within the blocks.

Feel free to point out any more inefficiencies, I’m open to all suggestions!

Good call, I’m not used to having to deal with the L1 and L2 cache with older architectures. I figured the problem lay there but wasn’t sure how to solve it. Your fix worked, but as mentioned by tmurray, I should use volatile, which is what I did. I just had trouble getting volatile to work in previous attempts.

Yeah, as soon as you pointed that out, I started banging my head against a wall. Thanks for pointing out that obvious flaw.

You sir, are correct. I had trouble getting it to compile the first couple of rounds through, but found out all the casting I had to do elsewhere to get it to work.

__device__ void lock(int *mutex) {

	while(atomicCAS(mutex, 0, 1) != 0);

}

__device__ void unlock(int *mutex) {

	atomicExch(mutex, 0);

}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFirstFree)

{

	// Find entry if it exists

	unsigned int hashValue = hash(key, table.count);

	

	lock(&globalHashLocks[hashValue]);

	int bucketHead = table.entries[hashValue];

    int currentLocation = bucketHead;

	

	bool found = false;

	Entry currentEntry;

	while (currentLocation != -1 && !found) {

		currentEntry = table.pool[currentLocation];

		if (currentEntry.data.x == key) {

			found = true;

		} else {

			currentLocation = currentEntry.next;

		}

	}

	

	if (currentLocation == -1) {

		// If entry does not exist, create entry

		Entry newEntry;

		int newLocation = atomicAdd(globalFirstFree, 1);

		newEntry.data.x = key;

		newEntry.data.y = count;

		newEntry.data.z = sum;

		newEntry.data.w = sumSquared;

		newEntry.next = bucketHead;

		// Add entry to table

		table.pool[newLocation] = newEntry;

		table.entries[hashValue] = newLocation;

	} else {

		currentEntry.data.y += count;

		currentEntry.data.z += sum;

		currentEntry.data.w += sumSquared;

		table.pool[currentLocation] = currentEntry;

	}

	unlock(&globalHashLocks[hashValue]);

}

__device__ void add_to_shared_hash_table(unsigned int key, unsigned int value, volatile int *entries, Entry *pool, int *sharedHashLocks, int *sharedFirstFree)

{

	// Find entry in shared hash if it exists

	size_t sharedHashValue = hash(key, HASH_TABLE_SIZE);

	

	lock(&sharedHashLocks[sharedHashValue]);

	int bucketHead = entries[sharedHashValue];

	int currentLocation = bucketHead;

	

	bool found = false;

	Entry currentEntry;

	while (currentLocation != -1 && !found) {

		currentEntry = pool[currentLocation];

		if (currentEntry.data.x == key) {

			found = true;

		} else {

			currentLocation = currentEntry.next;

		}

	}

	if (currentLocation == -1) {

		// If entry does not exist, create entry

		Entry newEntry;

		int newLocation = atomicAdd(sharedFirstFree, 1);

		newEntry.data.x = key;

		newEntry.data.y = 1;

		newEntry.data.z = value;

		newEntry.data.w = (value * value);

		newEntry.next = bucketHead;

		// Add entry to table

		pool[newLocation] = newEntry;

		entries[sharedHashValue] = newLocation;

	} else {

		currentEntry.data.y++;

		currentEntry.data.z += value;

		currentEntry.data.w += (value * value);

		pool[currentLocation] = currentEntry;

	}

	unlock(&sharedHashLocks[sharedHashValue]);

}

__global__ void computeAggregation(unsigned int size, uint2 *recordset, Table table, int *globalHashLocks, int *globalFirstFree)

{

	__shared__ int sharedHashLocks[HASH_TABLE_SIZE];

	__shared__ volatile int entries[HASH_TABLE_SIZE];

	__shared__ Entry pool[BUCKETS_SIZE];

	__shared__ int sharedFirstFree;

	uint2 tuple;

	Entry sharedEntry;

	unsigned int threadIndex, blockIndex, warpThread;

	unsigned int threadID = (blockIdx.x * blockDim.x) + threadIdx.x;

	unsigned int stride = (gridDim.x * blockDim.x);

	if (threadIdx.x == 0) {

		sharedFirstFree = 0;

	}

	blockIndex = 0;

	while (blockIndex < HASH_TABLE_SIZE) {

		threadIndex = blockIndex + threadIdx.x;

		if (threadIndex < HASH_TABLE_SIZE) {

			sharedHashLocks[threadIndex] = 0;

			entries[threadIndex] = -1;

		}

		blockIndex += blockDim.x;

	}

	__syncthreads();

	blockIndex = blockIdx.x * blockDim.x;

	while (blockIndex < size) {

		threadIndex = blockIndex + threadIdx.x;

		if (threadIndex < size) {

			tuple = recordset[threadIndex];

			for (warpThread = 0; warpThread < warpSize; warpThread++)

				if ((threadIdx.x % warpSize) == warpThread)

					add_to_shared_hash_table(tuple.x, tuple.y, entries, pool, sharedHashLocks, &sharedFirstFree);

		}

		__syncthreads();

		blockIndex += stride;

	}

	__syncthreads();

	blockIndex = 0;

	while (blockIndex < BUCKETS_SIZE) {

		threadIndex = blockIndex + threadIdx.x;

		if (threadIndex < BUCKETS_SIZE) {

			sharedEntry = pool[threadIndex];

			for (warpThread = 0; warpThread < warpSize; warpThread++)

				if ((threadIdx.x % warpSize) == warpThread)

					add_to_global_hash_table(sharedEntry.data.x, sharedEntry.data.y, sharedEntry.data.z, sharedEntry.data.w, table, globalHashLocks, globalFirstFree);

		}

		__syncthreads();

		blockIndex += blockDim.x;

	}

}

__device__ __host__ unsigned int hash(unsigned int key, unsigned int count)

{

	return key % count;

}

Updated source attached.
src.zip (9.21 KB)