Assistance Needed with GPU Memory Allocation Error When Packing and Unpacking Arrays

Hello NVIDIA Forum,

I would like to consult a technical programming issue I’m encountering. I am packing four arrays, each with dimensions (31, 64, 5), into a one-dimensional array and transferring it to the GPU. After that, I unpack the one-dimensional array back into four (31, 64, 5) arrays for computation. However, I am encountering the following error:

Additionally, here is a snippet of my code:

Could someone please help me understand what might be causing this error and how to resolve it?

Thank you!
First, on the CPU, I load four different three-dimensional arrays into a one-dimensional array. Below is the kernel function call. Next, in the GPU kernel function, I unpack the one-dimensional array into four three-dimensional arrays. Finally, Nsight Compute reports an error.




It’s an out of memory error. In the kernel you have four “AB” arrays that are defined locally. In other words, every thread will need it’s own copy of the complete array thus consuming a lot of memory.

Since they are private to each thread, the arrays only need to be a single dimension 5 element array. The first two dimensions are extraneous.

Do you mean that in this kernel function, this large one-dimensional array has already been allocated to each thread? What I need to do is replace the AB 3D arrays with 1D arrays?

Think of the kernel as serial code that one thread is executing. There will be multiple threads each executing the same kernel code, but each will have it’s own local data. The “AB” arrays are declared locally so each thread will have it’s own private copy of each array (hence the out of memory error).

However, the IX and IY indices are invariant to each thread. So while each will have the full array, they are only accessing ABDENS(IX,IY,1:5), the rest of the elements aren’t used. Hence you can remove the first two dimension and reduce the array to ABDENS(1:5).

Now in looking at the second part of the code, I’m not sure this will work for the rest of the algorithm due to the “XI” index. It seems you’re presuming the “AB” arrays are global. So perhaps the correct solution is to make these global arrays and then pass them into the kernel instead of declaring them locally.

However you then have a synchronization issue in that they need to be completely filled before assignment into the “AA” arrays. You could put in a syncthreads between the sections, but this only syncs the threads in the same block so you’ll also need to restrict the launch configuration to a single block. Given the small size of the arrays and that the kernel is only copying data, this should be fine. (Note that you may need to add a loop in the kernel in the event you have fewer threads than array elements so each thread can process multiple elements).

Though if you do want multiple blocks, then you need a global sync. You can investigate using cooperative groups to do the global sync, but this will add overhead and hurt performance. Instead, I’d suggest you split this into two kernels, the first to perform the gather into the “AB” arrays, and the second to do the assignment into “AA”. So long as the two kernels are on the same stream (or there’s a cudaDeviceSyncronize between them), you can ensure AB will be filled before assignment.

Thank you so much for helping me solve another issue! The range of XI in the AB array is from 1 to 31, and the range of IX in the AA array is from 1 to 64. I want to use one XI element from the AA array for every two IX elements in the AB array. If I do this, will the access conflicts increase? Would it be better if I first assign the values of XI to the odd-indexed IX elements, and then to the even-indexed IX elements?

I wouldn’t think so. As I read the code you have two threads reading the same value from the AB arrays. The value will likely be in the cache so should be fine. If you do the two pass method, it may mean the value gets evicted from the cache and needs to be reread.

I’m just guessing so you’d need to experiment to see. Though given these are very small kernels that do no computation, only assignment, I personally wouldn’t spend too much time trying to optimize them. I doubt they are very impactful to the overall performance of the application. Get it working and then move on to more performance relevant sections of the code.