I am working on a CUDA kernel that requires the use of both 1D and 2D arrays in shared memory. I need to declare and use these arrays within the same kernel, but I am encountering some limitations with the amount of shared memory I can allocate, despite the theoretical limits suggested by my device capabilities.
Here’s the situation: My CUDA kernel needs both 1D and 2D arrays in shared memory, and based on what my device can handle (according to nvaccel), I should be good to go with up to 49152 bytes per block (which should be about 6144 doubles, given each double is 8 bytes).
But, here’s the snag:
I’ve set up a 2D array of 32x32 alongside a 1D array of size 32.
This adds up to (32*32 + 32) * 8 = 8448 bytes, but I’m stuck with this setup and can’t seem to allocate more even though it’s way below my device’s max capacity of 49152 bytes.
Does anyone have any idea why I can’t use more of my device’s shared memory capacity? Am I missing something in how shared memory gets allocated? Could other factors like kernel configurations or register usage be limiting my shared memory usage?
I’d really appreciate any thoughts or pointers on what might be going wrong or how I can push this limit. Thanks in advance for taking the time to help out!
There aren’t other factors that prevent shared usage up to 48K per block. If you go beyond 48K, there are other considerations.
My guess would be some other aspect of your code is breaking when you increase the shared usage, and it has nothing to do with shared usage. But its impossible to say without an actual complete example of what you are doing.
The last GPU, where 48 KiB was the final limit was the Pascal generation = Nvidia GT(X) 10x0 (and the Xavier embedded/SoC boards from Turing generation).
Is your 1D array the externally defined array sharedPivot?. Does it have sharedMemSize, which is (32*32+32) elements instead of 32? Can you confirm the size of ROW_PER_BLOCK and COL_PER_BLOCK to both be 32?
You cannot have more threads per block. But you can allocate (a bit) more of shared memory.
So your fewer (1024) threads have to do the work for a 64x64 data array. Introduce some for loops. The (limited!) block size is not automatically identical to your data size.
For example (with threadIdx.x == 0..31; blockDim.x == 32):
for (int i = threadIdx.x; i < 64; i += blockDim.x)