Global thread barrier

Progress! The atomicInc appears to have been the main bottleneck. I was able to eliminate it by having each block update a different counter, then wait until all counters have been updated. My test program now runs in 32 seconds, compared to 35 when using separate kernels and 38 seconds with the old version of syncAllThreads(). Considering that it spends more than half its time doing things unrelated to this algorithm, that’s a significant speedup.

Here is the new version:

[codebox]device void syncAllThreads(short* syncCounter, short newCount)

{

__syncthreads();

if (threadIdx.x == 0)

    syncCounter[blockIdx.x] = newCount;

if (threadIdx.x < gridDim.x)

{

    volatile short* counter = &syncCounter[threadIdx.x];

    do

    {

    } while (*counter != newCount);

}

__syncthreads();

}

[/codebox]

This version is a little more complicated to use than the previous version, so let me describe exactly what the requirements are. N is the number of thread blocks.

  1. N must be <= the number of SMs in your GPU.

  2. The number of threads/block must be >= N. (Each of the first N threads monitors one of the counters, so if there are fewer threads/block than there are blocks, some counters won’t be monitored.)

  3. syncCounter must be an array of length N.

  4. newCount can be any value so long as it is different from what is currently stored in the elements of syncCounter.

  5. If you call this multiple times in a single kernel, you must have at least two independent arrays of counters, and you must not use the same array twice in a row. Otherwise you could get a deadlock.

In my kernel I call it repeatedly in a loop, so I invoke it as

syncAllThreads(iteration%2 == 0 ? syncCounter1 : syncCounter2, iteration);

As long as there are at least four iterations, each array of counters will finish containing a different value from what will be used in the first iteration on the next call, so there’s no need to reset them before calling again.

Peter

neat optimization (iirc you don’t need the “do” block, but it doesn’t hurt). Any particular reason you’re using shorts? Also, condition (4) could be a bit problematic in libraries; perhaps you could roll a location id and iteration counter into one int, and increment the iteration counter in the method?

also, I’m curious what your benchmarks show if you actually use a lot of active blocks. N could be as large as 240 on the gtx 280, correct?

Overall I think global block sync is a bad idea though, for the reasons you’ve already mentioned. Would you mind explaining what you’re trying to use it for? There might be other solutions, which aren’t GPU-specific.

I’m curious how this idea would work:

  • create an array of tasks
  • each block grabs a task with no dependencies, and marks it as started
  • if all tasks have dependencies, the block must recompute a task. It should choose one that will enable a new task to be processed, if possible.
  • the block marks the tasks as finished and repeats the loop

You definitely do need it. Without that, threads will just go ahead without waiting for other blocks to complete!

No particular reason, except that a short is more than large enough for this purpose.

No, see condition 1. On a GTX280, N can never be larger than 30.

Not for my purposes. I have an iterative algorithm that repeatedly updates a large matrix. Each element of the matrix in iteration i depends on many elements from iteration i-1 that don’t follow any regular pattern. So I have one thread compute each element, and then I need to do a global sync before starting the next iteration.

Peter

Interesting work, and a clever way of avoiding atomics.

One other thing I might suggest is in your previous version, all threads were waiting for the counter to return to zero, which means that while waiting, the memory was being loaded by all the warps, not just the first one. If a block has a lot of warps this could increase the overall bw which might make the atomics slower than they would be otherwise.

I’d be interested if this were faster (though perhaps not faster than your atomic-free version):

__device__ void syncAllThreads(unsigned int* syncCounter)

{

	__syncthreads();

	if (threadIdx.x == 0) {

		atomicInc(syncCounter, gridDim.x-1);

		volatile unsigned int* counter = syncCounter;

		do

		{

		} while (*counter > 0);

	}

	__syncthreads();

}

Sorry for the confusion, I just meant syntactically “do {} while (condition);” == “while(condition)” iirc.

You have 30 SM’s. Each SM could have 8 active blocks, right?

Nothing immediately comes to mind other than what I mentioned last time. How fast can you determine which elements are necessary for the next iteration? Is the per-element computation relatively intensive?

regards,

Nicholas

Problem with this barrier is that global memory is not sequentially consistent. That means that if thread writes data to memory and then synchronizes using this barrier, other threads are not guaranteed to see the written data right after they pass the barrier. Instead, it may show up somewhat later. That’s why __threadfence() is introduced: it ensures that all written data is visible to all other running thread blocks. However, according to my observations, __threadfence is very slow.

If I correctly understand, you code seems to work right even without using __threadfence. That’s interesting. I wonder how reliable it is though, as it doesn’t have to be according to any specs.

I agree with you. Since global barrier is feasible, if sequential consistency is guaranteed, the Programming Guide does not have to say that “threads from different blocks must share data using 2 seperate kernel invocation”. Perhaps if writes from the same thread to global memory are not reordered by hardware, there is a way to guarantee inter-block communication using something like semaphore?

That was actually the first optimization I tried. It made no measurable difference to the performance. That was how I concluded that the loop wasn’t the bottleneck.

Peter

You’re correct, I’m relying on behavior of current hardware which is not guaranteed to work the same way in future hardware. Of course, we all do the same thing in other places too, such as assuming that groups of 32 sequential threads are automatically synchronized. When CUDA 2.2 comes out, I’ll try sticking in a __threadfence() and see if that makes it slower.

What would be even better, of course, is if CUDA would provide a global version of __syncthreads() so we wouldn’t have to implement it ourselves…

Peter

I don’t know if limitation only applies to this code but I implemented another one and I have 2 active blocks. So I can have 30720 (60*512) threads. N can be 240, but it depends your shared mem. usage.

Yap that is correct as long as you have all of your threads running and no block has to wait for all the rest to finish you can even have up to 240 blocks. Has any body tried to use the __threadfence() ? to make sure all global memory writes were flushed ?

Cheers

I really wonder that does this work now? Hopefully it is a good solution, since I have the same requirement as the author :)

One big problem that I see with those syncs is that they’ll bombard the memory controller, so any blocks that haven’t reached the sync point yet will run very slowly if they need to do any sort of memory access. You’d have to put in a idle loop in between each test of *counter to reduce the queries to only one for every few thousand clock cycles for it to work well.

There will be 500 clocks between each “bombardement” though, where the thread isn’t elegible for execution.

Short of a regular sleep() function, I can’t see which other instruction would be idling a thread more efficiently?

That’s an issue every 500 clocks for each thread. It takes around 10000 threads to saturate a GTX280 with reasonable occupancy, so you end up with up to 20 memory requests per clock. Now, warp broadcast cuts this down to more like 1 memory request per memory clock. We’ll assume that the controller uses the minimum transaction of 32 bytes, that gives us 256 bits of bandwidth each clock, plus memory controller overhead. That’s more than 50% of the device’s peak bandwidth right there! Worse, all these requests will fall into the same memory bank (64 bit bus each, I believe), meaning that it would need 4x that bank’s max capacity to keep everything smooth. Hence, any memory requests of threads that aren’t finished yet that resolve to that bank will take 4 times longer on average to get through.

You’re right that this is probably more expensive than it could be if Nvidia implemented a fully supported “sync all threads” feature. I really hope they will do that in the future, because it’s absolutely essential for iterative algorithms. But even this hack is still many times faster than the alternatives. Those are 1) launch a kernel to do one iteration, wait for it to complete, and transfer the result back to the host to see whether it has converged yet, or 2) run the entire kernel on one SM, thus throwing away 29/30ths of your available processing power.

Peter

Out of curiosity - what algorithm are you implementing exactly?

This one:

Constant constraint matrix approximation

Peter

You’ll only need one memory acces per block, and given the nature of the problem you might prefer only one block per MP. All warps - except for the first - on that MP will then completely idle on the second sync point

// run your kernel, all warps will end up at a first sync point:

do {

__syncthreads()

// then, as described in detail earlier in this thread

if (threadIdx.x == 0) globalsync_hack()

// most threads will skip the above and be idling here

__syncthreads()

} while(loop–)

Caution, global synchronisation barrier may deadlock even if you launch less blocks than GPU can handle at once due to some driver bug:
[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtopic=150567[/url]