Basic question about shared memory usage and device functions

Hi, I recently ran into an unintuitive behavior when using shared memory in device functions that I’m hoping to better understand.

To illustrate, say I have a kernel that performs a calculation in several steps. A simple example might look something like:

template < typename input_type, typename output_type >
__global__ void kernel(input_type * inputs, output_type * outputs) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  auto intermediate1 = foo(inputs[i]);
  auto intermediate2 = bar(intermediate1);
  outputs[i]         = qux(intermediate2);
}

where each of those functions (foo, bar, qux) that perform the different phases of the calculation use shared memory internally to access and mutate data from the other threads in the block (using foo as an example),

template < typename T >
__device__ auto foo(T input) {
  __shared__ T buffer[64];     // shared memory declared in device function scope
  buffer[threadIdx.x] = input; // put things in shared memory
  ...                          // do the calculations in shared memory
  auto answer = ...;           // eventually get the result
  return answer;
}

But when I compile kernel with -Xptxas=-v, I was surprised to see that the total shared memory usage reported is the sum of the shared memory usage of foo, bar, qux. I expected that, since the shared memory went out of scope when returning from one of those device functions, the high water mark would be the max of the shared memory used by foo, bar and baz, rather than the sum.

A simple workaround is to go back and change the implementation to something like:

template < typename input_type, typename output_type >
__global__ void kernel(input_type * inputs, output_type * outputs) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;

  // forward-declare the shared memory to be used
  // in the different phases of calculations
  __shared__ union {
    foo_cache_type foo;
    bar_cache_type bar;
    qux_cache_type qux;
  } cache;

  auto intermediate1 = foo(inputs[i],     cache.foo);
  auto intermediate2 = bar(intermediate1, cache.bar);
  outputs[i]         = qux(intermediate2, cache.qux);
}

where the union type explicitly ensures that the shared memory usage is only the max (rather than the sum) of each step, but this design is clumsier:

  • it requires refactoring the functions to accept new arguments
  • you need a way to resolve those types foo_cache, bar_cache, etc and to forward-declare them
  • using a union to do this is sort of a dubious C++ practice

Does anyone know of a better way to work around this issue?

I’m also curious about why the shared memory usage goes as the sum of the shared memory used in mutually exclusive scopes (i.e. rather than the max), is this something that could be improved in nvcc? If not, there’s probably a good reason it behaves this way, and if so I’d like to learn more.

For reference, this was tested using Ubuntu 20.04 and CUDA 11.4

Use a statically allocated shared memory outside of your functions (at kernel scope) and pass a pointer to shared memory in the kernel to your functions.

Similar: use dynamically allocated shared memory (at kernel scope) and pass the pointer to your functions.

You can certainly eliminate the union this way.

An implication/requirement for this is that your functions either don’t actually share data between threads and have harmonized memory layout/usage per thread (probably not typical) or else your functions are completely synchronized from each other, so as to avoid concurrency problems. But that is pretty much implicit in your suggestion to use the union method.

Independent (static) shared memory allocations must be aggregated, because in the general case the compiler has no knowledge whether they will be used concurrently, or not. (Without any evident synchronization or explicit usage pattern, the code you have shown is subject to this hazard. The CUDA threads in a threadblock do not all execute in lockstep.)

Thank you for the insight!

An implication/requirement for this is that your functions either don’t actually share data between threads and have harmonized memory layout/usage per thread (probably not typical) or else your functions are completely synchronized from each other, so as to avoid concurrency problems. But that is pretty much implicit in your suggestion to use the union method.

Good point-- I neglected to mention (or write the pseudocode to show) that the actual project synchronizes between phases. The example device function implementation should have looked like

template < typename T >
__device__ auto foo(T input) {
  __shared__ T buffer[64];     // shared memory declared in device function scope
  buffer[threadIdx.x] = input; // put things in shared memory
  ...                          // do the calculations in shared memory
  auto answer = ...;           // eventually get the result
  __syncthreads();
  return answer;
}

so that every thread in a block should be in agreement about which of foo, bar, or qux is being evaluated at a given point in time.

Independent (static) shared memory allocations must be aggregated, because in the general case the compiler has no knowledge whether they will be used concurrently, or not. (Without any evident synchronization or explicit usage pattern, the code you have shown is subject to this hazard. The CUDA threads in a threadblock do not all execute in lockstep.)

This is the point I was missing: the mutual exclusivity of the shared memory in this example (which admits them being overlapped) comes from the combination of:

  1. disjoint scopes
  2. block-wide synchronization between them

I wonder if that conclusion is true in general. If so, it would be cool if nvcc could recognize when a kernel contains a sequence of synchronized + disjoint scopes

{
  __shared__ char A[m];
  // ...
}
__syncthreads();
{
  __shared__ char B[n];
  // ...
}

and deduce that A and B could be overlapped.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.