So I’ve been working on this problem for a while and still can’t seem to sort it out. Basically the code works fine when only one thread block is utilized. However, if 2 or more are utilized, the data in global memory is “corrupt”. But to explain further…
Essentially (in the attached code), I’m building a hash table in shared memory, and after I’m done populating it, building an aggregated hash table in global memory. When I’m using just one thread block, everything functions fine. However, when I use two or more, the data in the hash table is incorrect, or corrupt.
I’ve been reading online and trying to figure out what the potential issue is. From my understanding, one reason this could happen is because if the locks or data writes are not propagated properly. So I tried using __threadfence() and that did absolutely nothing.
I’m at a wits end and any help is appreciated! src.zip (9.45 KB)
Well, I tried threadfence and threadfence_system. However, it didnt seem to help. Also, when there is only one block everything works fine, so why would the problem be interblock communication?
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
// __threadfence(); // not necessary
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
__threadfence(); // you need to make sure that pointer globalFirstFree is updated before other threads can see it.
unlock(globalFreeLock);
Entry newEntry;
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
// __threadfence(); // not necessary
table.entries[hashValue] = newLocation;
// __threadfence(); // not necessary
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
// __threadfence(); // not necessary
}
__threadfence(); // make sure DRAM buffer is flushed
// or equivalently write of this thread is seen by other threads before unlock.
unlock(&globalHashLocks[hashValue]);
// __threadfence(); // not necessary
}
I modify
lock(globalFreeLock);
int newLocation = (*globalFirstFree)++;
unlock(globalFreeLock);
__threadfence();
and remove redudant __threadfence().
As far as I know, one __threadfence can guarantee order of write before and after critical section.
I gave it a try with the modifications you suggested. However, it still doesn’t seem to synchronize properly.
What baffles me is this. If I run the same code on a GTS250 (Compute 1.1 device, with all shared memory atomics removed), it works perfectly fine. The exam same code compiled for 2.0 and run on a C2050 or C2070 has a corrupt hash table. From everything I can figure, this is caused by the synchronization differences between the non-Fermi and Fermi architecture.
Does anyone have any more insight or suggestions? Is there anyway to get actual help from NVIDIA developers as to whats going on?
For example, I made the edits in the following file and compiled it for 1.1. If I run it on a compute 1.1 device, it works great (albeit slower). On a 2.0 device, the hash table is corrupt. Same number of blocks and same runtime environment every other given way. Only difference is the device it is run on. gpuDB_kernels.cu (4.72 KB)
Do you have any shared memory use in your kernel (sorry but I am not downloading a zip archive to find out)? If you do, it could be compiler optimization which is breaking something. Have a look at page 3 of the Fermi compatibility guide in the toolkit. You might need to declare shared memory as volatile inside a kernel to make sure that results are actually written to shared memory rather than kept in register. That can break implicit intrawarp synchronization.
Yes, I do have shared memory usage in my kernel, however, all of it has been commented out in the version I posted that works on a GTS250 and does not on a C2050/C2070.
In trying to find a solution to my problem, I read the compatibility guide and the use of “volatile” to make sure shared memory writes are visible to all threads in the warp. However, utilizing volatile did nothing (perhaps I used it wrong), and since it works fine with 1 block (with multiple warps) and does not with 2 or more blocks, I don’t see how shared memory would be the problem.
I know it’s a hassle to download a zip and try running it, but if you wouldn’t mind, I would really appreciate it. At this point, I’m willing to pay for support if someone can help me figure out what the issue is.
Sorry, I forgot to mention that in order to run it on the GTS 250 in compute 1.1 mode, I had to make the following changes:
gpuDB.cu: line 9
#define BLOCK_SIZE 512
to
#define BLOCK_SIZE 256
and commented out the shared memory code:
__global__ void computeAggregation(unsigned int size, uint2 *idata, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
/*
__shared__ int sharedHashLocks[HASH_ENTRIES];
__shared__ int sharedFreeLock;
__shared__ int entries[HASH_ENTRIES];
__shared__ Entry pool[HASH_SIZE];
__shared__ int sharedFirstFree;
*/
uint2 tuple;
Entry sharedEntry;
unsigned int index, warpThread;
unsigned int threadID = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int stride = (gridDim.x * blockDim.x);
/*
if (threadIdx.x == 0) {
sharedFreeLock = 0;
sharedFirstFree = 0;
}
__syncthreads();
for (index = threadIdx.x; index < HASH_ENTRIES; index += blockDim.x) {
sharedHashLocks[index] = 0;
entries[index] = -1;
}
__syncthreads();
*/
for (index = threadID; index < size; index += stride) {
tuple = idata[index];
for (warpThread = 0; warpThread < warpSize; warpThread++) {
if ((threadIdx.x % warpSize) == warpThread) {
add_to_global_hash_table(tuple.x, 1, tuple.y, (tuple.y * tuple.y), table, globalHashLocks, globalFreeLock, globalFirstFree);
//add_to_shared_hash_table(tuple.x, tuple.y, sharedHashLocks, &sharedFreeLock, entries, pool, &sharedFirstFree);
}
}
__syncthreads();
}
/*
__syncthreads();
for (index = threadIdx.x; index < HASH_SIZE; index += blockDim.x) {
sharedEntry = pool[index];
for (warpThread = 0; warpThread < warpSize; warpThread++) {
if ((threadIdx.x % warpSize) == warpThread) {
add_to_global_hash_table(sharedEntry.data.x, sharedEntry.data.y, sharedEntry.data.z, sharedEntry.data.w, table, globalHashLocks, globalFreeLock, globalFirstFree);
}
}
__syncthreads();
}*/
}
When I compile with
nvcc -Xptxas -v -arch=sm_11 gpuDB_kernels.cu
, I get:
ptxas info : Compiling entry function '_Z10testKerneljP5uint2S0_' for 'sm_11'
ptxas info : Used 4 registers, 24+16 bytes smem
ptxas info : Compiling entry function '_Z18computeAggregationjP5uint25TablePiS2_S2_' for 'sm_11'
ptxas info : Used 23 registers, 64+16 bytes smem, 16 bytes cmem[1]
Bump! I’m trying to find a solution to this problem, does anyone know of any ways to contact CUDA developers at NVIDIA? Or any avenues of official support?
As [font=“Courier New”]size[/font] is not a multiple of the blocksize, different threads will encounter different numbers of [font=“Courier New”]__syncthreads()[/font] in the loops. Behavior in the case is undefined and usually bad things happen.
Good point, that may explain the hangups I’ve encountered. Reworked the code to use while loops and make sure all threads in the warp/block synchronize together. Code posted below, look good?
Yeah, what does that mean? I think there are some pretty awesome guys willing to help when they find the time. I try to do the same when I see a solution to a problem (however rare that is).
There isn’t so much synchronizing “across” blocks as much as data being coherent between them. Also, all performance gains will be lost if I have to reassemble and aggregate the hash tables computed within the blocks.
Feel free to point out any more inefficiencies, I’m open to all suggestions!
Good call, I’m not used to having to deal with the L1 and L2 cache with older architectures. I figured the problem lay there but wasn’t sure how to solve it. Your fix worked, but as mentioned by tmurray, I should use volatile, which is what I did. I just had trouble getting volatile to work in previous attempts.
Yeah, as soon as you pointed that out, I started banging my head against a wall. Thanks for pointing out that obvious flaw.
You sir, are correct. I had trouble getting it to compile the first couple of rounds through, but found out all the casting I had to do elsewhere to get it to work.
__device__ void lock(int *mutex) {
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(int *mutex) {
atomicExch(mutex, 0);
}
__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFirstFree)
{
// Find entry if it exists
unsigned int hashValue = hash(key, table.count);
lock(&globalHashLocks[hashValue]);
int bucketHead = table.entries[hashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = table.pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
Entry newEntry;
int newLocation = atomicAdd(globalFirstFree, 1);
newEntry.data.x = key;
newEntry.data.y = count;
newEntry.data.z = sum;
newEntry.data.w = sumSquared;
newEntry.next = bucketHead;
// Add entry to table
table.pool[newLocation] = newEntry;
table.entries[hashValue] = newLocation;
} else {
currentEntry.data.y += count;
currentEntry.data.z += sum;
currentEntry.data.w += sumSquared;
table.pool[currentLocation] = currentEntry;
}
unlock(&globalHashLocks[hashValue]);
}
__device__ void add_to_shared_hash_table(unsigned int key, unsigned int value, volatile int *entries, Entry *pool, int *sharedHashLocks, int *sharedFirstFree)
{
// Find entry in shared hash if it exists
size_t sharedHashValue = hash(key, HASH_TABLE_SIZE);
lock(&sharedHashLocks[sharedHashValue]);
int bucketHead = entries[sharedHashValue];
int currentLocation = bucketHead;
bool found = false;
Entry currentEntry;
while (currentLocation != -1 && !found) {
currentEntry = pool[currentLocation];
if (currentEntry.data.x == key) {
found = true;
} else {
currentLocation = currentEntry.next;
}
}
if (currentLocation == -1) {
// If entry does not exist, create entry
Entry newEntry;
int newLocation = atomicAdd(sharedFirstFree, 1);
newEntry.data.x = key;
newEntry.data.y = 1;
newEntry.data.z = value;
newEntry.data.w = (value * value);
newEntry.next = bucketHead;
// Add entry to table
pool[newLocation] = newEntry;
entries[sharedHashValue] = newLocation;
} else {
currentEntry.data.y++;
currentEntry.data.z += value;
currentEntry.data.w += (value * value);
pool[currentLocation] = currentEntry;
}
unlock(&sharedHashLocks[sharedHashValue]);
}
__global__ void computeAggregation(unsigned int size, uint2 *recordset, Table table, int *globalHashLocks, int *globalFirstFree)
{
__shared__ int sharedHashLocks[HASH_TABLE_SIZE];
__shared__ volatile int entries[HASH_TABLE_SIZE];
__shared__ Entry pool[BUCKETS_SIZE];
__shared__ int sharedFirstFree;
uint2 tuple;
Entry sharedEntry;
unsigned int threadIndex, blockIndex, warpThread;
unsigned int threadID = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int stride = (gridDim.x * blockDim.x);
if (threadIdx.x == 0) {
sharedFirstFree = 0;
}
blockIndex = 0;
while (blockIndex < HASH_TABLE_SIZE) {
threadIndex = blockIndex + threadIdx.x;
if (threadIndex < HASH_TABLE_SIZE) {
sharedHashLocks[threadIndex] = 0;
entries[threadIndex] = -1;
}
blockIndex += blockDim.x;
}
__syncthreads();
blockIndex = blockIdx.x * blockDim.x;
while (blockIndex < size) {
threadIndex = blockIndex + threadIdx.x;
if (threadIndex < size) {
tuple = recordset[threadIndex];
for (warpThread = 0; warpThread < warpSize; warpThread++)
if ((threadIdx.x % warpSize) == warpThread)
add_to_shared_hash_table(tuple.x, tuple.y, entries, pool, sharedHashLocks, &sharedFirstFree);
}
__syncthreads();
blockIndex += stride;
}
__syncthreads();
blockIndex = 0;
while (blockIndex < BUCKETS_SIZE) {
threadIndex = blockIndex + threadIdx.x;
if (threadIndex < BUCKETS_SIZE) {
sharedEntry = pool[threadIndex];
for (warpThread = 0; warpThread < warpSize; warpThread++)
if ((threadIdx.x % warpSize) == warpThread)
add_to_global_hash_table(sharedEntry.data.x, sharedEntry.data.y, sharedEntry.data.z, sharedEntry.data.w, table, globalHashLocks, globalFirstFree);
}
__syncthreads();
blockIndex += blockDim.x;
}
}
__device__ __host__ unsigned int hash(unsigned int key, unsigned int count)
{
return key % count;
}