Shared memory to Global memory data transfer

Hello,

As a test case, I have sent a code below.

Makefile (810 Bytes)
m_gpu.f90 (1.2 KB)
main.f90 (1.2 KB)

I run 5600 (=NUMEL) kernels (defined as INTEGRAL in the code) parallel. The test was to append 5.0D0 to the first index of a shared memory array and then to append the value in that position of the shared array to a global array (The lines 42-47 in the m_gpu.f90). Yet it does not work. What am I doing wrong? Can you help?

It is supposed to give the result “TEST_ARRAY 5.0D0 5.0D0 5.0D0 5.0D0 5.0D0” yet it gives “TEST_ARRAY 0.0D0 0.0D0 0.0D0 0.0D0 0.0D0”.

Thanks,
Y. Altıntop

Hi Y. Altıntop,

A few issues here.

Adding error checking after your call, you’ll see that the kernel is failing with an “invalid argument” because the shared memory size is too big. You’ve set it to the max 48K, but the kernel has a fixed size array of 32 causing you to go over the max.

Given ARRAY is fixed size, you’re not actually using the dynamic shared memory. For this, you want to use an automatic or assumed-size array. i.e.:

DOUBLE PRECISION, SHARED :: ARRAY(blockDim%x)
or
DOUBLE PRECISION, SHARED :: ARRAY(*)

Given how the code is now, i.e. one element of ARRAY per thread, the size you want to use would be:

SIZEINBYTES = BLOCK_X*8

Of course, if you updated the code so the size of ARRAY is different, just be sure to match the size of the automatic array times the data type size in bytes.

-Mat

1 Like

Hi Mat,

Thank you for the answer.

What I understood is that each SM has 64KB that can be shared between shared memory and L1 cache.
At a time, only one kernel can be launched on each SM. Since I allocated a memory of 48KB shared memory for each SM, I thought that (32*8/1024=)0.25KB for each kernel would be enough to be stored in the spared shared memory.

I would like to ask you the following.
I have 16 SM’s and 48KB shared memory for each. Assuming the SM’s run the same number of Kernels at the end, (5600/16=)350; is it going to try to allocate (350*0.25KB=)87.5KB of shared memory instead of writing over the 0.25KB when I launched it statically? How does the fixed size array of 32 in each kernel cause me to go over the max?

Thanks,
YA

The default shared memory size per block is 48K which you specify as the dynamic shared memory size in the kernel launch configuration. But when adding the 256 Bytes per block with the additional static shared memory array, you end up going over the max and thus get the launch error.

At a time, only one kernel can be launched on each SM. Since I allocated a memory of 48KB shared memory for each SM, I thought that (32*8/1024=)0.25KB for each kernel would be enough to be stored in the spared shared memory.

You do not need to set the dynamic shared memory size if your kernel uses static fixed size shared memory arrays. You only need dynamic shared memory when using dynamically sized automatic arrays.

The total shared memory usage is the combined total of both the dynamic shared memory (as set at runtime via the CUDA launch configuration) and the static shared memory (set at compile time via fixed sized shared arrays).

What I understood is that each SM has 64KB that can be shared between shared memory and L1 cache.

The default shared memory size is 48KB. To get to 64KB, you need to call cudaFuncSetAttribute to increase the max amount of dynamic shared memory, but the exact max will vary depending of the device so may not give you 64KB on all devices.

Of course your original example doesn’t actually use dynamic shared memory, so this extra memory goes to waste. Again, if you want to use dynamic shared memory, the shared array in the kernel needs to be an automatic or declared as assumed-size.

I have 16 SM’s and 48KB shared memory for each. Assuming the SM’s run the same number of Kernels at the end, (5600/16=)350; is it going to try to allocate (350*0.25KB=)87.5KB of shared memory instead of writing over the 0.25KB when I launched it statically?

Each SM has it’s own shared memory which is partitioned amongst the blocks actively running concurrently on the SM. This memory would be re-partitioned and reused for subsequent blocks scheduled after the previous blocks are complete.

Keep in mind that an SM can run up to 2048 concurrent threads or a max of 32 blocks (at 64 threads per block). But if you have one block that uses all the available shared memory, only that single block can be running on the SM and thus limit the occupancy. Given your block has 256 threads, you’re getting at best 12.5% theoretical occupancy since only 1/8th of the total number of concurrent threads can actually run.

Not counting register usage which also has impact, to get to 100% occupancy, you can have a max of 6KB of shared memory per block with a block size of 256, or 24KB with a block size of 1024.

Note that the CUDA occupancy calculator is now part Nisight-Compute and might help in understanding how shared memory and register use impact the theoretical occupancy.

1 Like

Thank you for the great explanation!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.