Global Barriers and State Explosion

This might be a bit off topic, but I think that people here would probably appreciate it more than others.

So if anyone has ever wondered why there is no global barrier across CTAs in CUDA, you might want to take a look at this: http://www.gdiamos.net/papers/stateExplosion.pdf

I would be interested in what people think of the proposed idea in this short paper. Would a destructive global barrier be a useful programming construct for CUDA? What other suggestions might people have for dealing with this problem?

You’re off by a factor of 65535 in your calculation for CUDA’s maximum state–you can have 2D grids of 65535x65535.

I need to think about your proposal more before I can really respond to it, but I guess we have interesting conversations…

edit: maybe it would be better if you could clarify some terms you use right now just so we’re on the same page. first, when you say persistent shared memory, is this one globally visible shared memory region? second, is this just a CUDA kernel boundary?

Thanks for the correction. I guess that I had never tried this before and assumed the limit was at 65536 since that is the largest that I have tried.

Yes, persistent shared memory would be like global memory, but possibly smaller and fixed size like shared memory so that it could be fit on chip.

The barrier would be similar to a kernel boundary, but you would declare it in the kernel body (maybe like __globalSync()) and have some additional state (maybe something like persistent int array[100]; ) that would be modifiable by all threads in all CTAs. You could make it so that you could not reference any variable computed before the __globalSync() and shared memory would get wiped out when you called __globalSync().

edit: The advantage would be that you would not have to go back to the host to launch the next kernel.

A few short comments:

    It’s “i.e.” not “ie.” and I think you meant “e.g.” (for example) instead of “i.e.” (that is)

    65536x65536x512 = 2^41. This is much more convenient to read than 2199023255552.

    Section 3: abaility -> ability

    Section 3: “In this section, I discuss several alternatives” -> Alternatives to what? And where do you discuss them? I don’t see much discussion in this section.

    Section 5: compuation -> computation

It seems like state explosion is a real problem, but I don’t think you are properly introducing it. Instead, you are presenting a bit of a false dilemma as your motivation. If all threads must be able to communicated with each other, then the same amount of memory is required to maintain state regardless of the number of processors involved. Thus highly parallel architectures present no inherent advantage in this respect over serial architectures.

Next, what is the persistant shared memory you’re talking about? This seems to be a crucial piece of the puzzle, but you gloss over it. Is this saving the per-block shared memory? If yes, then this seems like a workable solution for now, but it merely delays the managability of the state explosion. If no, then it’s hard to see how your solution offers any real advantage over sequential kernel launch so the difference should be made clear.

Thanks for the comments, as an aside, this is not meant to be anything really formal, just a collection of thoughts. Please excuse any grammatical errors as there are likely to be many.

It is true that the same amount of memory is required to maintain state regardless of the number of processors involved.

What I meant to say was that a future architecture will have more memory and processor resources due to Moore’s law scaling. If a program that is written today is intended to be executed on current generation hardware as well as future hardware without being rewritten, it will have to be able to specify a large amount of parallelism without requiring more state than can be materialized on a modern architecture.

A parallel architecture will not have any more advantage over a serial architecture in terms of state requirement. I only mean that a future architecture will probably have significantly more memory resources than a modern architecture regardless of whether it is parallel or serial. Most parallel programming models typically require more state to execute more work in parallel. Additionally, I believe that future architectures will continue to improve parallelism without much progress in serial performance. State explosion refers to the idea that programmers today will be prevented from expressing enough parallelism to scale on future architectures due to the memory limitations of current architectures.

The idea is to have globally shared memory that is persistent across global barriers. This memory would be fixed size and accessible by all threads in all blocks similar to global memory, but with potentially faster performance. For example, an iterative algorithm could continuously update a single shared variable that could be used to decide whether or not to launch another iteration or to finish the kernel.

This may or may not be a good idea. I would be interested to see what other ideas people here have to enabling global synchronization without requiring an excessive amount of state.

Can you elaborate on this a bit more? If the size of memory is fixed, then it’s hard for me to see how this is useful in the kind of producer/consumer example you use for illustration. In this scenario, N threads would have to recreate their state from global memory plus some memory of size S where S << N. Also, how would these threads update the single shared variable, as this would (seemingly) require a global atomic operation, which equates to global synchronization?

It seems to me that you are actually suggesting a lower latency mechanism for re-launching kernels from the device rather than a global synchronization method.

I agree that it does seem less than satisfactory to have a fixed amount of memory for global synchronization. However, it could still be useful for several common parallel operations. For example, any operation that is commutative such as the Mpi_Allreduce family of operations could be implemented using a fixed amount of state and atomic operations. Similarly, a broadcast of a single value from one controller thread to all others could be accomplished using the same mechanism.

It seems to me that it might be useful now for people to starting thinking about what the implications of this kind of restriction would have on the design of new applications. Simply the need to select algorithms that can easily re-create their state after global synchronization could expose a class of algorithms that are more easily scalable than others.

More or less yes. Re-launching kernels is a means of doing global synchronization. It currently has some unnecessary performance overheads (high latency of going to the CPU and then back to the GPU to launch a new kernel) that this approach could help with, but yes, I don’t see my solution being a dramatic improvement over what we have now. I am more concerned about making people aware of the problem than proposing a perfect solution.

After some more thought, it seems like a useful option may be to have a way to launch kernels such that the number of blocks will be automatically set to match the number of SMs. In such cases, global synchronization should be possible. This would require a slightly different way of writing kernels in that they would probably have to loop a number of times and perform the actual computation in the loop, but it seems a small price to pay for having synchronization available.

This would be one way to do it, but it would require the programmer to manually serialize units of work into the number of SMs. One of the nice things about the current model is that the compiler or runtime can do this for the programmer. The fact that CTAs cannot communicate via anything other than atomics (which are commutative) means that they can be executed in any order or automatically serialized. For example, it is possible to execute a program with a loop over the number of CTAs as in

for( i in all CTAs )

  execute_cta(i);

Executing the same program on a multi-core machine can be done by breaking the number of CTAs into groups and statically assigning each group to a different processor as in:

for( i in blocks )

  spawn_thread( i );

spawn_thread( i )

  for( j in all CTAs in block i )

   execute_cta( j );

You still have to worry about state explosion in these cases though.

for( i in all CTAs )

  execute_cta_before_barrier(i);

  // any variables that are alive across the barrier must be saved here

for( i in all CTAs )

  // any variables that are alive across the barrier must be loaded here

  execute_cta_after_barrier(i);

My point is that it makes sense to force the programmer to actively deal with the amount of state that must be kept alive across a global barrier. Requiring the programmer to deal with the number of SMs as you suggest would be one way of doing this. However, it seems to me that having the compiler or runtime start with the maximum amount of parallelism (threads and thread blocks) and then have the freedom to fuse them together and map them to SMs would be more desirable as it would reduce program complexity. There have been several works including some of ours that show that it is relatively easy to convert CUDA threads into loops in the compiler.