Having a little trouble with mutex/synchronisation

I’m attempting to write a solution to calculate the histogram of a set of data. This may not be the most efficient way to implement it, but at the moment my algorithm is to have a thread examine an element of data and increment the correct bin of the histogram accordingly. So for N data elements I have N threads - typically across a number of blocks, since the thread limit per block is 512 and we usually have far more elements of data than that.

Since synchronisation between thread blocks isn’t possible, I’m getting around that for now by having a histogram for each thread block, which at the end will be added to each other to produce the single final histogram. So each thread block writes to a different set of memory locations, and each block’s threads synchronise access to those locations.

Or that’s the theory at least, in practise it looks like there are some mutual exclusion issues.

Here’s a snippet of the relevant kernel code:

for (int j =0; j < num_bins; j++)

  {

	start = j*bin_size;

	end = start + (bin_size-1);	

	if(colour >= start && colour <= end)

	{

  g_odata[baseindex+j]++;

	}	

  }

__syncthreads();

g_odata is the output array. BaseIndex is a variable calculated earlier to point to the start of the histogram for the current block, based on the blockID.

As I said, it does not produce the correct results. I believe it is a synchronisation issue because when I build and run in emuDebug mode, I get the correct results (memory access is sequential in emulation mode, I believe, so that’s what makes me think its a mutex problem).

If I try to put __syncthreads(); immediately after the write to g_odata in the if block, the program locks up and I have to do a hard reset of my machine. In emulation mode, no errors are thrown in this case.

Anyone have any ideas? I’m using version 0.8 on WinXP with the latest beta drivers from nVidia and a 8800 GTS.

Thanks!

Hi Peter, can you provide more info about your specific application and its requirements? That might help us provide better pointers.

We have a histogram sample in progress for a future release of the SDK. Unfortunately it’s not quite ready for release, but it should be soon!

In this code, more than one thread can write to the same location in g_odata at the same time. The results of this are undefined. In practice, only one of them will succeed.

Yes, this is illegal. A __syncthreads() invokes a barrier instruction. No thread may proceed past the barrier until all threads of the block have reached it. Therefore if you put a barrier inside a branch that 1 or more threads do NOT take, then you will hang the GPU. (In the future we will have better handling of this error.) If the GPU is your primary display device, currently the result is a system hang.

There is one interesting way of “simulating” concurrent writes that one of our engineers came up with. It uses a “tag and test” methodology, and relies on the fact that only one warp of (32 on G80) threads runs at a time:

// one histogram array per 32-thread warp:

shared unsigned int histogram[NHIST][NBIN];   

unsigned short bin = pixelValue();  // load a pixel from memory

// do this until update is not overwritten:

do 

{

    u32 myVal = histogram[myHist][bin] & 0x7FFFFFF;  // read the current bin val

    myVal = ((tid & 0x1F) << 27) | (myVal + 1);          // tag my updated val

    histogram[myHist][bin] = myVal;                           // attempt to write the bin

} while (histogram[myHist][bin] != myVal);                 // while updates overwritten

The requirement here is that maximum counts in the histogram bins are small enough that you have some extra most significant bits at the top of each bin where a tag can be written to indicate the last thread that has updated that bin (the example code uses 27-bit bins). So when a thread wants to write to the histogram, it reads the bin value, increments it, and then appends its thread ID to the top of the new value before writing it. It then immediately reads the value again and if its tag is still there it knows the write succeeded. If not, it tries again.

If threads update random bins (not true in most applications, actually), average number of iterations for a warp of threads to update the histogram should be <= 3.

Also note that we have to have a histogram per warp, not per thread block. After computing per-warp histograms, we have to reduce across warps to get a block histogram, and then across blocks to get the global histogram.

I haven’t tested the code above, so use it at your own risk. It’s tricky/hacky, but it’s kind of neat. Enjoy!

Mark

Thanks so much for the comprehensive reply!

Basically this is a ‘first step’ in a larger process of shot boundary detection for video (well, actually, in a number of different processes we might try to apply the GPU to). So a frame of video is decoded (resolution: 352x240) and its histogram is computed. The frame is a RGB frame so there’s 253440 elements to be examined. They are 8-bit values. The number of bins we use is usually 32.

Very interesting! I’ll look forward to it. I’ve only seen modest gains with occlusion query techniques using Cg/OpenGL, so histograms was one of the first things I wanted to try on CUDA to see if there’s any improvement.

Ah, thank you.

The ‘hack’ you’ve provided is interesting and looks pretty clever, and I’ll see if I can give it a try, but I’m not sure if I’d meet all the constraints (for example, bin updates would be fairly random given the data, I’d imagine). Is there any other way to provide mutual exclusion? I had understood previously that using synch_threads after a write should do so, but I guess the advice I was getting there was incorrect. Mutual exclusion would seem like a fairly necessary facility when dealing with such a highly parallel system (?)

Thank you so much again for your help! :)

Even if writes aren’t completely random, one of the threads in the warp will succeed at each iteration through that loop. Therefore the maximum number of iterations is W, where W is the warp size. It’s expensive in the worst case, but it’s really the only way to simulate atomic operations on current hardware (which lacks intrinsic atomics). This worst case is equivalent to serializing accesses to the histogram array.

Another way, if you have a small enough number of bins to fit in shared memory, is to keep a separate histogram per thread, then reduce the histograms across all threads to create the thread block histogram. This reduction alone will cost B * N*log N, where N is the thread block size and B is the number of bins.

If B is 32, N is 256 (say you are doing a 16x16 block), and W is 32, then the former technique will take

O(W * N)

The latter will take O(BNlog N)

The reduction technique is slower by a factor of log N, so I would choose the flag and test technique. The flag and test method also only keeps a histogram per warp, while the reduction method keeps one per thread. If histogram bins are larger than a short int each, then you won’t be able to keep 256 32-bin histograms in a 16KB shared memory. With only one per warp, you can fit multiple thread blocks per multiprocessor, which is preferable.

Mark

Ah OK, I get you now. Previously I thought you were saying writes shouldn’t be random for better performance, so I understand now what you’re saying. My writes should be totally random, so hopefully this will be a decent fit…

Thanks. I’ll try implementing the flag and test method and see how it goes. I may be back with more questions on an implementation level! But either way I’ll let you know how I get on.

Thanks so much again, really appreciated!

edit - actually, here’s a first question - how do I know which warp the current thread is in? Or in other words, how do I know which histogram I should be writing in to?

Thread IDs 0-31 are Warp 0?

32-63 are Warp 1?

And so on?

Exactly. For 1D thread blocks, the warp ID would be:

unsigned short warpID = (threadIdx.x >> 5);

Mark

Cool, thank you.

I actually realised how horribly inefficient the first histogram implementation I provided above is…in terms of calculating which bin to assign to. How embarassing! But I’m doing something a little better now…

Anyway, I have tried to implement this flag and test technique, but it looks like it’s not working fully. I made just a couple of small changes:

unsigned int myVal = 0;

  do

  {

     myVal = g_odata[warpID*32 + bin] & 0x7FFFFFF;  // read the current bin val

     myVal = ((tid & 0x1F) << 27) | (myVal + 1);          // tag my updated val

     g_odata[warpID*32 + bin] = myVal;                           // attempt to write the bin

  } while (g_odata[warpID*32 + bin] != myVal);                 // while updates overwritten

I changed myVal to an unsigned int because my compiler was not recognising u32, and took it outside of the do/while because it was complaining that myVal was undefined otherwise. It’s probably highlighting my own ignorance that I ‘fixed’ those problems this way…would these changes muck it up? I also index into a 1D output array, so I had to change my indexing. The start of a warp’s histogram should be at warpID*32 (in the case I’m testing with, there’ll be 32 histograms, i.e. checking 1024 elements).

At the moment it looks like about a quarter of the threads manage to get in and update. Again, in emulation mode, I get the expected result.

Are you sure the values you are writing do not clobber the tags? They must be <= 27 bits to ensure you don’t overwrite the tag with the “|” (OR).

Also, you say you have 32 warps, but the maximum thread block size on G80 is 512 threads, so that’s not possible.

Mark

Well I’ve had the interesting discovery that only the first quarter of the histograms are being written to. In debug mode all the threads make their write, but all to the first 8 histograms - the last 24 are untouched. So on the GPU itself, usually the first 8 histograms are written to, but only one thread manages to make it. This seems to suggest a problem with indexing, but I’m not sure…(?)

I think I should have 32 warps - I have 1024 elements, so I’m spawning 1024 threads - using 4 thread blocks of 256 threads each.

Is it the case that warpIDs are not global? That for each block the first warp in that block starts at ID 0? That might explain this behaviour I’m seeing. In that case, I would need to use the blockID somehow in my indexing?

To update - I have changed my indexing to assume that warpIDs run from 0 to 512 for each thread block, rather than from 0 to n for n threads in total. Now on the GPU itself the same histogram locations are written to as in emulation mode, across all 32 histograms. However, whereas multiple writes are made to these locations in emulation mode, only 1 write makes it on the GPU - so it still seems there is a mutex issue.

The values I’m writing in this test to each bin typically don’t exceed 2 or 3, so I don’t think the tags should be clobbered. Unless there is a subtlety to the use of u32 versus unsigned int, here, for myVal?

Thanks again for your time!

Yes, that’s correct, but also, each block gets its own shared memory – so you can think of them as independent wrt shared memory. But when addressing global memory you should take the blockID into account.

Mark

Thanks, that makes sense. As per my last post, I’ve modified my code to take this into account, since I am currently writing out to global memory.

(On a side note, would it be faster to create a set of histograms for the current block in shared memory and write to those? The histograms would have to be copied back out to main memory, but I guess I’m wondering if this happens with every thread invocation, or only once when all threads in the block have finished? If it’s the latter I guess that should be faster than writing out to global memory all the time).

Back on topic, I’m still not entirely sure why only one write per histogram bin is getting through. Here’s my code as it is now:

// what warp is this in the current block? 

  unsigned short warpID = (tid >> 5);

  

  // the index of the first bin of the histogram for this warp in the output array

  unsigned int baseIndex = 8*bid*32 + warpID*32;

 // do this until update is not overwritten:

  unsigned int myVal = 0;

  do

  {

     myVal = g_odata[baseIndex + bin] & 0x7FFFFFF;  // read the current bin val

     myVal = ((tid & 0x1F) << 27) | (myVal + 1);          // tag my updated val

     g_odata[baseIndex + bin] = myVal;                         // attempt to write the bin

  }

  while (g_odata[baseIndex + bin] != myVal);                 // while updates overwritten

How are you assigning “tid”?

Yes, you should definitely do this in shared memory. If you don’t use shared memory for reduction-type operations, you are losing the benefits of CUDA.

Mark

Like so:

// access thread id

  const unsigned int tid = threadIdx.x;

Thanks, interesting. I’m a little confused by the order of events though…if I run this kernel to calculate n histograms which will be subsequently reduced to one, don’t I have to write those histograms out to global memory so that they can be passed to a subsequent kernel for reduction? Or are you saying I should roll the reduction into this kernel somehow?

edit - actually I think I see how this would work now. After the bin incrementation I should sync_threads and then write the shared memory histograms out to global memory (all threads in the block should be finished by then). I suppose you could also reduce the block’s histograms to one at this point too. I was confused before because I wasn’t sure how you’d specify to only write the histograms out to global memory once, when everything was finished.

Yes, reduce the warp histograms to one block histogram before writing out. Then another kernel would reduce the block histograms to a global histogram.

I think I described this early in the thread. Sorry if I was cryptic. :)

The key is to do as much as possible in shared memory – it is as fast as registers, while global device memory has >200 cycle latency.

I’m not sure why your code is not working, I’ll have another look…

Mark

I’m not the sharpest tool in the shed on occasion, but I think I finally get what you’re saying now ;) I’ve changed my code now to use shared memory (i think), though I haven’t yet implemented the reduction. Here’s the entire kernel as it is as the moment:

__global__ void

testKernel( unsigned int* g_idata, unsigned int* g_odata) 

{

  // access thread id

  const unsigned int tid = threadIdx.x;

  // access block id

  const unsigned int bid = blockIdx.x;

  // num threads

  const unsigned int num_threads = blockDim.x;

  

  // current block's histograms - 8*32

  extern __shared__  unsigned int histos[];

  histos[tid] = 0;

  

  __syncthreads();

 // read in the pixel value 

  float colour = g_idata[(bid*num_threads) + tid];

  

  // which bin does it belong to? 256 values / 32 bins = 8

  int bin = colour/8;

  

  // what warp is this in the current block? 

   unsigned short warpID = (tid/32);

  

  // the index of the first bin of the histogram for this warp

  unsigned int baseIndex = 32*warpID;

 // do this until update is not overwritten:

  unsigned int myVal = 0;

  // g_odata replaced by histos

  do

  {

     myVal = histos[baseIndex + bin] & 0x7FFFFFF;  // read the current bin val

     myVal = ((tid & 0x1F) << 27) | (myVal + 1);          // tag my updated val

     histos[baseIndex + bin] = myVal;                         // attempt to write the bin

  }

  while (histos[baseIndex + bin] != myVal);                 // while updates overwritten

  

  __syncthreads();

  

  g_odata[32*8*bid+tid] = histos[tid];  

  

}

I invoke the kernel with this:

// setup execution parameters

dim3  grid( 4, 1, 1);

dim3  threads( 256, 1, 1);

// execute the kernel

testKernel<<< grid, threads,256*sizeof(unsigned int) >>>( d_idata, d_odata);

The input data is just a small set of test data - 1024 ints between 0 and 255.

Thanks so much again for your great help!

I may not be fully understanding how threads execute or are scheduled and the like here, but I worked through what I think is a possible interleaving of thread instructions (taking an example of just 2 threads running concurrently), whereby the end result in a histogram could be just 1 even with two increments, using the proposed mutex construct. I am thinking the 3 lines of code in the do/while loop would really need to be atomic for this to work.

If all the threads are all executing those instructions at the exact same time, then I think that is possible they will all have the same myVal, and all break out of the do/while after one iteration (leaving the result just as 1 for that bin).

If I’m not making sense, I’ll try to concretise. Say if we have two threads, thread 0 and 1. Is this interleaving possible?

Thread 0 reads the value of the bin into myVal - myVal is 0.
Thread 1 reads the value of the bin into myVal - myVal is 0.
Thread 0 increments myVal - myVal is 1.
Thread 1 increments myVal - myVal is 1.
Thread 0 writes myVal back to the bin - bin is 1.
Thread 1 writes myVal back to the bin - bin is 1.
Thread 0 tests if the bin value is equal to myVal, and breaks out of the loop since they are equal.
Thread 1 tests if the bin value is equal to myVal, and breaks out of the loop since they are equal.

… so despite two attempted increments by two different threads, the end result in the bin is just 1. Same problem with 3,4,5,6… threads, moreso if they’re all executing exactly concurrently.

Or is this not the case?

edit - oops, I see it’s not the case. The values won’t be the same since of course, we’re also tagging the bins with unique ids. My bad! Sorry about this, my understanding seems to evolve as I post…

To update, I thought the problem might be that we were reserving 5 bits only for the threadID tags - which would be fine for 32 threads in a block, but my blocks have 256. So I changed the code to reserve 8 bits for the tags, but the results are still the exact same as before.

Here’s the new do/while loop with 8-bit tags (I think it’s correct, but there may be a mistake…my bit manipulation is a little rusty!):

do

  {

     myVal = histos[baseIndex + bin] & 0xFFFFFF;     

     myVal = ((tid & 0xFF) << 24) | (myVal + 1);

     histos[baseIndex + bin] = myVal;                         // attempt to write the bin

  }

while (histos[baseIndex + bin] != myVal);                 // while updates overwritten

Hi Peter,

As far as I know this histogram update technique only operates between the threads of a warp (32 threads).

Following one warp’s execution of the while statement’s read on the last line, the thread scheduler could switch to another warp from the same block. This other warp might then resume from an earlier state by executing the bin write from the second last line. After the next line (the while’s read) two threads from different warps are both satisfied with their increment, while the value itself has only increased by 1, not 2.

Cheers,
Paul

Yes, but each warp has its own histogram, so that shouldn’t matter, right?

Mark