Hi all,
in this code example, a local array x2
is declared for each iteration and dummy data is copied into it:
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char *argv[]) {
int nDim = 3;
int nIter = 1000000;
double x1[3];
x1[0] = 1.0; x1[1] = 2.0; x1[2] = 3.0;
#pragma omp target
for(int i = 0; i < nIter; i++) {
double x2[nDim]; // Segfault for large iteration counts (stacklimit exceeded?)
for(int d = 0; d < nDim; ++d) {
x2[d] = x1[d];
}
}
}
It is compiled with nvc(version 23.3), first ignoring OpenMP. To reproduce the error, set the stack limit to a fixed value, e.g. ulimit -s 8192
. In this case, a segfault occurs when the binary is executed. The error can be circumvented by either setting a higher or unlimited ulimit, or by moving the declaration of x2
out of the loop.
Compiling with OpenMP offloading support, using -mp=gpu -Minfo=all
gives another hint in the compilation diagnostics:
17, Accelerator restriction: unsupported statement type: opcode=DEALLOC
This statement refers to the declaration of x2
, since it does not appear when this declaration is pulled out of the loop.
I just found out that the issue is resolved when nDim
is declared as const int
, in which case a static allocation is made. Apparently the compiler otherwise assumes that nDim
can changes and makes a dynamic allocation. However, shouldn’t the memory be freed again in that case?
Hi Christian,
Apologies for the late response. This one took a bit of digging but since I was mentoring a GPU hackathon last week, I didn’t get a chance to investigate until today.
For the stack overflow issue on the host, we are putting the VLA on the stack but it doesn’t appear we’re using the correct stack save / stack restore, hence all allocation is getting on the stack. I’ve filed TPR #34033 and sent it to engineering for review.
For the unsupported DEALLOC statement, we do support free on the device, so it’s unclear what’s wrong. Hence I’ve filed TPR #34034 and will have engineering investigate.
However, I would highly recommend you not use VLAs in device code. This will cause each thread to allocate and presumably deallocate the memory which can be detrimental to performance as the allocates get serialized. Also, the default device heap is relatively small which can cause heap overflows. While the heap can be increased, it’s generally better to avoid device side allocation.
The better approach is to hoist the declaration of the VLA to before the loop and then add the array to a “private” clause. This will also work around the DEALLOC issue.
-Mat