Assistance Needed with GPU Memory Allocation Error When Packing and Unpacking Arrays

1668115800 · January 9, 2025, 1:02pm

Hello NVIDIA Forum,

I would like to consult a technical programming issue I’m encountering. I am packing four arrays, each with dimensions (31, 64, 5), into a one-dimensional array and transferring it to the GPU. After that, I unpack the one-dimensional array back into four (31, 64, 5) arrays for computation. However, I am encountering the following error:

Additionally, here is a snippet of my code:

Could someone please help me understand what might be causing this error and how to resolve it?

Thank you!
First, on the CPU, I load four different three-dimensional arrays into a one-dimensional array. Below is the kernel function call. Next, in the GPU kernel function, I unpack the one-dimensional array into four three-dimensional arrays. Finally, Nsight Compute reports an error.

MatColgrove · January 9, 2025, 3:36pm

It’s an out of memory error. In the kernel you have four “AB” arrays that are defined locally. In other words, every thread will need it’s own copy of the complete array thus consuming a lot of memory.

Since they are private to each thread, the arrays only need to be a single dimension 5 element array. The first two dimensions are extraneous.

1668115800 · January 9, 2025, 3:59pm

Do you mean that in this kernel function, this large one-dimensional array has already been allocated to each thread? What I need to do is replace the AB 3D arrays with 1D arrays?

MatColgrove · January 9, 2025, 4:35pm

Think of the kernel as serial code that one thread is executing. There will be multiple threads each executing the same kernel code, but each will have it’s own local data. The “AB” arrays are declared locally so each thread will have it’s own private copy of each array (hence the out of memory error).

However, the IX and IY indices are invariant to each thread. So while each will have the full array, they are only accessing ABDENS(IX,IY,1:5), the rest of the elements aren’t used. Hence you can remove the first two dimension and reduce the array to ABDENS(1:5).

Now in looking at the second part of the code, I’m not sure this will work for the rest of the algorithm due to the “XI” index. It seems you’re presuming the “AB” arrays are global. So perhaps the correct solution is to make these global arrays and then pass them into the kernel instead of declaring them locally.

However you then have a synchronization issue in that they need to be completely filled before assignment into the “AA” arrays. You could put in a syncthreads between the sections, but this only syncs the threads in the same block so you’ll also need to restrict the launch configuration to a single block. Given the small size of the arrays and that the kernel is only copying data, this should be fine. (Note that you may need to add a loop in the kernel in the event you have fewer threads than array elements so each thread can process multiple elements).

Though if you do want multiple blocks, then you need a global sync. You can investigate using cooperative groups to do the global sync, but this will add overhead and hurt performance. Instead, I’d suggest you split this into two kernels, the first to perform the gather into the “AB” arrays, and the second to do the assignment into “AA”. So long as the two kernels are on the same stream (or there’s a cudaDeviceSyncronize between them), you can ensure AB will be filled before assignment.

1668115800 · January 9, 2025, 5:07pm

Thank you so much for helping me solve another issue! The range of XI in the AB array is from 1 to 31, and the range of IX in the AA array is from 1 to 64. I want to use one XI element from the AA array for every two IX elements in the AB array. If I do this, will the access conflicts increase? Would it be better if I first assign the values of XI to the odd-indexed IX elements, and then to the even-indexed IX elements?

MatColgrove · January 9, 2025, 5:33pm

I wouldn’t think so. As I read the code you have two threads reading the same value from the AB arrays. The value will likely be in the cache so should be fine. If you do the two pass method, it may mean the value gets evicted from the cache and needs to be reread.

I’m just guessing so you’d need to experiment to see. Though given these are very small kernels that do no computation, only assignment, I personally wouldn’t spend too much time trying to optimize them. I doubt they are very impactful to the overall performance of the application. Get it working and then move on to more performance relevant sections of the code.

Topic		Replies	Views
allocatable arrays inside device data structures Legacy PGI Compilers	5	7386	August 10, 2017
How to solve memory allocation problem in cuda?? Teaching & Curriculum Support	1	4326	September 25, 2015
Using unified memory for 2Dim and 3 Dim array CUDA Programming and Performance	2	575	November 18, 2018
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9803	December 7, 2020
How to define a three-dimensional array? define a three-dimensional array on GPU CUDA Programming and Performance	13	11985	October 10, 2008
__device__ function array help CUDA Programming and Performance	4	1320	September 30, 2010
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16377	November 22, 2014
Different memory usage on GPU vs CPU for the same data set Legacy PGI Compilers	2	2842	August 11, 2017
Dynamic memory allocation by several gangs at the same time nvc, nvc++ and nvfortran	6	152	October 28, 2024
Multidimensional array, cudaMalloc CUDA Programming and Performance	1	7224	December 8, 2008

Assistance Needed with GPU Memory Allocation Error When Packing and Unpacking Arrays

Related topics