Need Help with Shared Memory Allocation for 1D and 2D Arrays in CUDA

jeeadv2021failure · May 14, 2024, 4:48pm

I am working on a CUDA kernel that requires the use of both 1D and 2D arrays in shared memory. I need to declare and use these arrays within the same kernel, but I am encountering some limitations with the amount of shared memory I can allocate, despite the theoretical limits suggested by my device capabilities.

Here’s the situation: My CUDA kernel needs both 1D and 2D arrays in shared memory, and based on what my device can handle (according to nvaccel), I should be good to go with up to 49152 bytes per block (which should be about 6144 doubles, given each double is 8 bytes).

But, here’s the snag:

I’ve set up a 2D array of 32x32 alongside a 1D array of size 32.
This adds up to (32*32 + 32) * 8 = 8448 bytes, but I’m stuck with this setup and can’t seem to allocate more even though it’s way below my device’s max capacity of 49152 bytes.

Does anyone have any idea why I can’t use more of my device’s shared memory capacity? Am I missing something in how shared memory gets allocated? Could other factors like kernel configurations or register usage be limiting my shared memory usage?

I’d really appreciate any thoughts or pointers on what might be going wrong or how I can push this limit. Thanks in advance for taking the time to help out!

Robert_Crovella · May 14, 2024, 4:56pm

You’ve given no indication why you are “stuck”, and in what way you have decided you can’t allocate more.

probably a short, complete code would help others to help you quickly.

Here is a trivial example of allocating a 32x32 array and a 1D array of size 32:

__global__ void k(){

  __shared__ double my_2D_array[32][32];  // 8k bytes
  __shared__ double my_1D_array[32]; // 256 bytes
  __shared__ double my_extra_array[1024];  // 8k bytes
  ...
}

There aren’t other factors that prevent shared usage up to 48K per block. If you go beyond 48K, there are other considerations.

My guess would be some other aspect of your code is breaking when you increase the shared usage, and it has nothing to do with shared usage. But its impossible to say without an actual complete example of what you are doing.

Curefab · May 14, 2024, 5:42pm

Also which GPU are you using?

The last GPU, where 48 KiB was the final limit was the Pascal generation = Nvidia GT(X) 10x0 (and the Xavier embedded/SoC boards from Turing generation).

jeeadv2021failure · May 14, 2024, 7:04pm

NVIDIA GeForce MX 350

Curefab · May 14, 2024, 7:07pm

Which really is Pascal generation with a maximum of 48 KiB per block.

jeeadv2021failure · May 14, 2024, 7:18pm

here is my specific part of code where i am facing the error

__global__ void updateMatrix(double* matrix,double* result,int lpivot ,int m, int n) 
{
     extern __shared__ double sharedPivot[];
    __shared__ double sharedResults[ROW_PER_BLOCK][COL_PER_BLOCK];

in the host side 
dim3 blockDim(COL_PER_BLOCK, ROW_PER_BLOCK);  // 32x32 threads per BLOCK
    dim3 gridDim(1, (m +ROW_PER_BLOCK - 1) / ROW_PER_BLOCK);    

    int sharedMemSize =  sizeof(double) * ( COL_PER_BLOCK * ROW_PER_BLOCK + COL_PER_BLOCK);

when i run this , i get the error as
Invalid configuration argument

Robert_Crovella · May 14, 2024, 7:38pm

what is the value of m?
can you also show the kernel launch line?

Curefab · May 14, 2024, 7:58pm

Is your 1D array the externally defined array sharedPivot?. Does it have sharedMemSize, which is (32*32+32) elements instead of 32? Can you confirm the size of ROW_PER_BLOCK and COL_PER_BLOCK to both be 32?

jeeadv2021failure · May 15, 2024, 9:19am

 dim3 blockDim(COL_PER_BLOCK, ROW_PER_BLOCK);  // 32x32 threads per BLOCK
 dim3 gridDim(1, (m +ROW_PER_BLOCK - 1) / ROW_PER_BLOCK);    

 int sharedMemSize =  sizeof(double) * ( COL_PER_BLOCK * ROW_PER_BLOCK + COL_PER_BLOCK); 
 
 updateMatrix<<<gridDim,blockDim, sharedMemSize>>>(matrix,d_result,lpivot,m, n);

m is the number of rows

jeeadv2021failure · May 15, 2024, 9:20am

yes both ROW_PER_BLOCK and COL_PER_BLOCK are 32

Robert_Crovella · May 15, 2024, 2:02pm

I was interested in the numerical value of m. Here’s what I would like to see. Place the following code immediately before the kernel call:

cudaError_t my_err_zz = cudaGetLastError();
std::cout << "***" << cudaGetErrorString(err) << std::endl;
std::cout << "***sharedMemSize = " << sharedMemSize << std::endl;
std::cout << "***gridDim: " << gridDim.x << " " << gridDim.y << " " << gridDIm.z << std::endl;
std::cout << "***blockDim: " << blockDim.x << " " << blockDim.y << " " << blockDim.z << std::endl;

Place the following code immediately after the kernel call:

my_err_zz = cudaGetLastError();
std::cout << "***" << cudaGetErrorString(err) << std::endl;

If necessary, add:

 #include <iostream>

at the top of the file.

Then recompile the code, run it, and paste here the output that is preceded by 3 stars on each line.

jeeadv2021failure · May 16, 2024, 8:45am

I am working on a 1024 x 1024 size matrix and here is the output from compiling the code you have give me

***no error
***sharedMemSize = 33280
***gridDim: 1 16 1
***blockDim: 64 64 1
***invalid configuration argument

striker159 · May 16, 2024, 9:02am

According to the documentation here CUDA C++ Programming Guide , the maximum number of threads per block is 1024. (64,64,1) exceeds this limit

Curefab · May 16, 2024, 12:12pm

Why are they both 64 now?

jeeadv2021failure · May 16, 2024, 12:40pm

as I said for 32 *32 it works , any more will stop the code to work

Curefab · May 16, 2024, 12:47pm

You cannot have more threads per block. But you can allocate (a bit) more of shared memory.

So your fewer (1024) threads have to do the work for a 64x64 data array. Introduce some for loops. The (limited!) block size is not automatically identical to your data size.

For example (with threadIdx.x == 0..31; blockDim.x == 32):

for (int i = threadIdx.x; i < 64; i += blockDim.x)

Topic		Replies	Views
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18137	October 17, 2013
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3717	October 14, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10595	December 24, 2008
threads and blocks CUDA Programming and Performance	3	1342	May 7, 2012
CUDA and Quadro FX 580 memory usage/corruption CUDA Programming and Performance	0	709	July 27, 2014
Kernel configuration and maximum array size problem. CUDA Programming and Performance	8	6473	January 25, 2009
shared memory and CUDA calculator CUDA Programming and Performance	6	4033	October 26, 2008
Launching Kernel Fail CUDA Programming and Performance	15	3399	May 28, 2014
Question about device query information, 48 cuda cores * 32, versus 1024 limitation. Maximum perform CUDA Programming and Performance	3	11490	June 16, 2011
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009

Need Help with Shared Memory Allocation for 1D and 2D Arrays in CUDA

Related topics