Is NVCC smart enough to compute thread independent quantities once and broadcast?

Does anyone know if NVCC will detect threadIdx invariant code and compute it once to shared memory, then have all threads read from it?
If so, what limitations are there (e.g. only works with 1-dimensional thread blocks)?

This is unlikely. Such an optimization could be a speed win, but it’s likely something complex enough you could do it manually but the compiler could never figure it out.

At least three reasons it’d be complex for any compiler to recognize and implement this:

  • An added (virtual?) syncthreads() is needed to make sure the value is computed (by some elected warp) before the other warps use it.
  • Some thread needs to do the actual compute, and it’s very hard to guarantee that the threads you want to use are actually available. You can’t just assume thread 0 can do the work, for example (what if it’s diverged!) , so you need to statically analyze the code to know that there is no divergence… or add a runtime check which is even more awkward.
  • The result uses shared memory, but now the lifetime of that temp shared memory is harder to judge.

So this kind of optimization could be very useful when organizing code by hand, but it’s just too tricky for compilers to recognize and implement safely and efficiently.

This is unlikely. Such an optimization could be a speed win, but it’s likely something complex enough you could do it manually but the compiler could never figure it out.

At least three reasons it’d be complex for any compiler to recognize and implement this:

  • An added (virtual?) syncthreads() is needed to make sure the value is computed (by some elected warp) before the other warps use it.
  • Some thread needs to do the actual compute, and it’s very hard to guarantee that the threads you want to use are actually available. You can’t just assume thread 0 can do the work, for example (what if it’s diverged!) , so you need to statically analyze the code to know that there is no divergence… or add a runtime check which is even more awkward.
  • The result uses shared memory, but now the lifetime of that temp shared memory is harder to judge.

So this kind of optimization could be very useful when organizing code by hand, but it’s just too tricky for compilers to recognize and implement safely and efficiently.

Just imagine the horror if the compiler did this to a loop variable

for (unsigned int i = 0; i < n; i++)

	...

one thread to do i++, write it to a shared var, then sync! That would blow away pretty much all TLP unless the loop body was really long.

Communication via shmem and __syncthreads() is expensive - the few times I’ve tried this “optimization” by hand it has actually been faster just to recompute the values in each thread and avoid the communication overheads. Of course, there must be a break-even point for some level of complexity of the once-computed value, I just don’t have a good idea of what that is.

Just imagine the horror if the compiler did this to a loop variable

for (unsigned int i = 0; i < n; i++)

	...

one thread to do i++, write it to a shared var, then sync! That would blow away pretty much all TLP unless the loop body was really long.

Communication via shmem and __syncthreads() is expensive - the few times I’ve tried this “optimization” by hand it has actually been faster just to recompute the values in each thread and avoid the communication overheads. Of course, there must be a break-even point for some level of complexity of the once-computed value, I just don’t have a good idea of what that is.