Dynamic Shared Memory Works Without Specifying It

I’m able to use dynamic shared memory without specifying the size in the kernel launch configuration. No compiler errors, no runtime errors, it just still works. My expectation is that it would fail with something like a segfault when reading/writing to shared memory from the kernel but it doesn’t. Is there some compiler optimization going on here? -arch=sm_80

It’s hard to make any direct statements without an example to inspect. When I try to set up an experiment, I get a fault reported by compute-sanitizer when trying to do what you describe:

# cat t358.cu
__global__ void k(int *r){

  extern __shared__ int sd[];
  for (int i = 0; i < 32768; i++) sd[i] = i;
  int sum = 0;
  for (int i = 0; i < 32768; i++) sum += sd[i];
  *r = sum;
}

int main(){

  int *r;
  cudaMallocManaged(&r, sizeof(int));
  k<<<1,1>>>(r);
  cudaDeviceSynchronize();
}
# nvcc -o t358 t358.cu -arch=sm_89 -lineinfo
# compute-sanitizer ./t358
========= COMPUTE-SANITIZER
========= Invalid __shared__ write of size 16 bytes
=========     at 0x90 in /root/bobc/t358.cu:4:k(int *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x410 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e950]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:libcudart_static_4d8b33a106dceb3c07a56e26de61f2d53bb62a68 [0x105de]
=========                in /root/bobc/./t358
=========     Host Frame:cudaLaunchKernel [0x707ee]
=========                in /root/bobc/./t358
=========     Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0xaf5e]
=========                in /root/bobc/./t358
=========     Host Frame:__device_stub__Z1kPi(int*) [0xade8]
=========                in /root/bobc/./t358
=========     Host Frame:k(int*) [0xae1e]
=========                in /root/bobc/./t358
=========     Host Frame:main [0xac5c]
=========                in /root/bobc/./t358
=========     Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_start_call_main [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../csu/libc-start.c:379:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xaac5]
=========                in /root/bobc/./t358
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x47e786]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:cudaDeviceSynchronize [0x48704]
=========                in /root/bobc/./t358
=========     Host Frame:main [0xac61]
=========                in /root/bobc/./t358
=========     Host Frame:../sysdeps/nptl/libc_start_call_main.h:58:__libc_start_call_main [0x29d90]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:../csu/libc-start.c:379:__libc_start_main [0x29e40]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame:_start [0xaac5]
=========                in /root/bobc/./t358
=========
========= ERROR SUMMARY: 2 errors
#

As you can see, there are indeed no compile errors. That is expected.

A lack of a runtime error in your case could be because you are not properly checking for errors, or because the code you have written doesn’t actually do shared accesses. I’m sure there are other possibilities.

This is the example code taken from an NVIDIA blog that I was testing after coming across a similar thing in a larger code base. I was surprised that it ran and produced the correct results.

include <stdio.h>

global void staticReverse(int *d, int n)
{
shared int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}

global void dynamicReverse(int *d, int n)
{
extern shared int s;
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}

int main(void)
{
const int n = 64;
int a[n], r[n], d[n];

for (int i = 0; i < n; i++) {
a[i] = i;
r[i] = n-i-1;
d[i] = 0;
}

int *d_d;
cudaMalloc(&d_d, n * sizeof(int));

// run version with static shared memory
// cudaMemcpy(d_d, a, nsizeof(int), cudaMemcpyHostToDevice);
// staticReverse<<<1,n>>>(d_d, n);
// cudaMemcpy(d, d_d, n
sizeof(int), cudaMemcpyDeviceToHost);
// for (int i = 0; i < n; i++)
// if (d[i] != r[i]) printf(“Error: d[%d]!=r[%d] (%d, %d)n”, i, i, d[i], r[i]);

// run dynamic shared memory version
cudaMemcpy(d_d, a, nsizeof(int), cudaMemcpyHostToDevice);
//dynamicReverse<<<1,n,n
sizeof(int)>>>(d_d, n);
dynamicReverse<<<1,n>>>(d_d, n);
cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < n; i++) {
if (d[i] != r[i]) printf(“Error: d[%d]!=r[%d] (%d, %d)n”, i, i, d[i], r[i]);
else printf(“Success: d[%d]=r[%d] (%d, %d)n”, i, i, d[i], r[i]);
}
}

If I change my test to 64 elements instead of 32768, it also does not report errors via compute-sanitizer. It happens to fail above 256 elements, which is more-or-less consistent with the compute-sanitizer report I previously posted.

So I would say that there appears to be some allocation of around 1024 bytes that is available whether explicitly declared or not. There might be reasons for this.

But without going down that avenue, I think its fair to say this is exploring UB.

Understood. Working with a contractor that had some Cuda unit tests setup on trivial data sizes and I was wondering why their testing harness didn’t detect the error. They kept saying it was working so I used this example to “show” them that it couldn’t work without errors and it ran successfully too! lol. Thanks for your responses I just wanted to do a sanity check and see if I’ve missed something obvious.

CUDA happens to reserve 1K bytes of dynamic shared memory per threadblock. This is documented. What isn’t documented is all the ramifications of that.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.