__shared__ CUDA 9.0 programming Guide v. 0_Simple/matrixMul.cu

The on line version of the CUDA 9.0 programming guide
says we must tell CUDA how much shared memory a kernel will use when it is launched (and the default is 0 bytes).
However the CUDA Samples matrix multiplication routine uses large amounts of shared
memory and does not mention this when the kernel matrixMulCUDA is launched using <<< >>>

Has the compiler become smart enough to calculate how much shared memory will be used?
Is there something about the use of templates that impacts on shared memory that I have not spotted?

Any help or guidance you can give would be most welcome

Thank you

ps: I am having problems with shared in my own kernel and was hoping that
the samples might help…

There are two different allocation methods for shared memory.

  1. static allocation, where the size is known at compile time:
__shared__ int my_shared_data[COMPILE_TIME_CONSTANT];
  1. dynamic allocation, where the size can be specified at run time:
__shared__ extern int my_shared_data[];

For the second case, it is required that kernel launch provide explicitly one of the optional launch configuration parameters:

my_kernel<<<grid, block, size_of_shared_memory_in_bytes, ...>>>(...);
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this 3rd optional parameter must be supplied
                                                        does not need to be compile-time constant

Note the syntax differences between the two shared memory declarations:

Using extern? It is dynamic.

Not using extern? It must be static.

It is possible to include both types of allocations in a single kernel, on a single launch.

Dear Bob,
Thank you for your rapid reply.
I am using the first type, ie: shared int en[1024]; inside my kernel
Can you confirm that I do not need to give the third argument (shared memory size) when
I launch the kernel using kernel<<<nblocks,block_size>>>(args…)


Correct, you don’t need the 3rd kernel launch config argument for that case.

Thanks Bob
Ps: it looks like my error will turn out to be a conventional index out bounds

Any time you are having trouble with a CUDA code, I recommend proper CUDA error checking, and running your code with cuda-memcheck.