Need Help with Shared Memory Allocation for 1D and 2D Arrays in CUDA

I am working on a CUDA kernel that requires the use of both 1D and 2D arrays in shared memory. I need to declare and use these arrays within the same kernel, but I am encountering some limitations with the amount of shared memory I can allocate, despite the theoretical limits suggested by my device capabilities.

Here’s the situation: My CUDA kernel needs both 1D and 2D arrays in shared memory, and based on what my device can handle (according to nvaccel), I should be good to go with up to 49152 bytes per block (which should be about 6144 doubles, given each double is 8 bytes).

But, here’s the snag:

  • I’ve set up a 2D array of 32x32 alongside a 1D array of size 32.
  • This adds up to (32*32 + 32) * 8 = 8448 bytes, but I’m stuck with this setup and can’t seem to allocate more even though it’s way below my device’s max capacity of 49152 bytes.

Does anyone have any idea why I can’t use more of my device’s shared memory capacity? Am I missing something in how shared memory gets allocated? Could other factors like kernel configurations or register usage be limiting my shared memory usage?

I’d really appreciate any thoughts or pointers on what might be going wrong or how I can push this limit. Thanks in advance for taking the time to help out!

You’ve given no indication why you are “stuck”, and in what way you have decided you can’t allocate more.

probably a short, complete code would help others to help you quickly.

Here is a trivial example of allocating a 32x32 array and a 1D array of size 32:

__global__ void k(){

  __shared__ double my_2D_array[32][32];  // 8k bytes
  __shared__ double my_1D_array[32]; // 256 bytes
  __shared__ double my_extra_array[1024];  // 8k bytes
  ...
}

There aren’t other factors that prevent shared usage up to 48K per block. If you go beyond 48K, there are other considerations.

My guess would be some other aspect of your code is breaking when you increase the shared usage, and it has nothing to do with shared usage. But its impossible to say without an actual complete example of what you are doing.

Also which GPU are you using?

The last GPU, where 48 KiB was the final limit was the Pascal generation = Nvidia GT(X) 10x0 (and the Xavier embedded/SoC boards from Turing generation).

NVIDIA GeForce MX 350

Which really is Pascal generation with a maximum of 48 KiB per block.

here is my specific part of code where i am facing the error

__global__ void updateMatrix(double* matrix,double* result,int lpivot ,int m, int n) 
{
     extern __shared__ double sharedPivot[];
    __shared__ double sharedResults[ROW_PER_BLOCK][COL_PER_BLOCK];

in the host side 
dim3 blockDim(COL_PER_BLOCK, ROW_PER_BLOCK);  // 32x32 threads per BLOCK
    dim3 gridDim(1, (m +ROW_PER_BLOCK - 1) / ROW_PER_BLOCK);    

    int sharedMemSize =  sizeof(double) * ( COL_PER_BLOCK * ROW_PER_BLOCK + COL_PER_BLOCK); 

when i run this , i get the error as
Invalid configuration argument

what is the value of m?
can you also show the kernel launch line?

Is your 1D array the externally defined array sharedPivot?. Does it have sharedMemSize, which is (32*32+32) elements instead of 32? Can you confirm the size of ROW_PER_BLOCK and COL_PER_BLOCK to both be 32?

 dim3 blockDim(COL_PER_BLOCK, ROW_PER_BLOCK);  // 32x32 threads per BLOCK
 dim3 gridDim(1, (m +ROW_PER_BLOCK - 1) / ROW_PER_BLOCK);    

 int sharedMemSize =  sizeof(double) * ( COL_PER_BLOCK * ROW_PER_BLOCK + COL_PER_BLOCK); 
 
 updateMatrix<<<gridDim,blockDim, sharedMemSize>>>(matrix,d_result,lpivot,m, n);

m is the number of rows

yes both ROW_PER_BLOCK and COL_PER_BLOCK are 32

I was interested in the numerical value of m. Here’s what I would like to see. Place the following code immediately before the kernel call:

cudaError_t my_err_zz = cudaGetLastError();
std::cout << "***" << cudaGetErrorString(err) << std::endl;
std::cout << "***sharedMemSize = " << sharedMemSize << std::endl;
std::cout << "***gridDim: " << gridDim.x << " " << gridDim.y << " " << gridDIm.z << std::endl;
std::cout << "***blockDim: " << blockDim.x << " " << blockDim.y << " " << blockDim.z << std::endl;

Place the following code immediately after the kernel call:

my_err_zz = cudaGetLastError();
std::cout << "***" << cudaGetErrorString(err) << std::endl;

If necessary, add:

 #include <iostream>

at the top of the file.

Then recompile the code, run it, and paste here the output that is preceded by 3 stars on each line.

I am working on a 1024 x 1024 size matrix and here is the output from compiling the code you have give me

***no error
***sharedMemSize = 33280
***gridDim: 1 16 1
***blockDim: 64 64 1
***invalid configuration argument

According to the documentation here CUDA C++ Programming Guide , the maximum number of threads per block is 1024. (64,64,1) exceeds this limit

Why are they both 64 now?

as I said for 32 *32 it works , any more will stop the code to work

You cannot have more threads per block. But you can allocate (a bit) more of shared memory.

So your fewer (1024) threads have to do the work for a 64x64 data array. Introduce some for loops. The (limited!) block size is not automatically identical to your data size.

For example (with threadIdx.x == 0..31; blockDim.x == 32):

for (int i = threadIdx.x; i < 64; i += blockDim.x)