Call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

There are several postings on this topic but I could not find exchanges that addressed the problem I have. I have several 3-dimensional arrays that describe chemical concentrations in a C++ class, and are updated on a time step basis in an initial value problem. I would like to use the HPC SDK to run sections of the code on GPUs.

I’ve created a stripped-down version of the offending module:

void UpdateArrays(double*** A_T_array, double*** AX_array) {
#pragma acc kernels
{
#pragma acc loop independent collapse(3)

for (int jx = 0; jx < MaxPoints.x; jx++)
  for (int jy = 0; jy < MaxPoints.y; jy++)
for (int jz = 0; jz < MaxPoints.z; jz++) {

  // diffusion of free ATM

  AX_array[jx][jy][jz]
    = A_T_array[jx][jy][jz];

}
}

}

Here is the relevant portion of the compile log:
EulerIntegrationMethod::UpdateArrays(double ***, double **):
502, Generating implicit copyin(this[:]) [if not already present]
505, Loop is parallelizable
506, Loop is parallelizable
507, Accelerator restriction: size of the GPU copy of A_T_array,AX_array is unknown
Loop is parallelizable
Generating Tesla code
505, #pragma acc loop gang, vector(128) collapse(3) /
blockIdx.x threadIdx.x /
506, /
blockIdx.x threadIdx.x collapsed /
507, /
blockIdx.x threadIdx.x collapsed */
507, Generating implicit copyout(AX_array[i1][i2][:]) [if not already present]
Generating implicit copyin(A_T_array[i1][i2][:]) [if not already present]
nvc++ -acc -Minfo=accel -Mlarge_arrays -O3 -o SiteRepairATMC_1.X SiteRepairATMC_1.X.o -g -fopenmp -L/home/bkeister/g++/lib -lphys -lconfig++ -L /usr/local/lib -lm

When I run the executable, I get the following:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I ran cuda-memcheck on the executable and get the following:

========= CUDA-MEMCHECK
========= Invalid global read of size 8
========= at 0x00000d28 in /home/bkeister/g++/app/SiteRepair-GPU/SiteRepairATMC_1.X.cc:511:EulerIntegrationMethod::UpdateArrays_507_gpu(double***, double***)
========= by thread (48,0,0) in block (0,0,0)
========= Address 0x00000000 is out of bounds

I thought this might be a matter of running out of memory, but I ran the code with a dynamic array size of [7][3][7] and still got the error.

There is another function involving two similar arrays for which I don’t get this runtime error.

I’m new to this, not sure what I’ve missed.

I think I see the source of the problem but I don’t know how to address it.

The function in question is a member function of a class that’s invoked by another member function. If I take the called function outside the class definition, the stripped down program runs without an address error, though I don’t know if it’s generating sensible numbers.

Is this possibly a copyin/copyout issue? If I call a similar function between classes I don’t get an error and the program runs correctly.

So the question is how I call a function with arrays within a class?

An illegal address error is similar to a segmentation violation on the host where a bad memory address is being accessed. How are you managing the device data? Are these array’s data members in the class? How are the arrays allocated?

Assuming yes, then the problem may be due to the hidden this pointer. Data members in a class method have a hidden this pointer so while you may write “A_T_array”, in actuality it really accessed as “this->A_T_array”. So if you don’t have the this pointer on the device, the compiler will implicitly copy it to the device and the device will be accessing the host address.

First, I’d add “default(present)” on your kernels directive so the compiler doesn’t try to implicitly copy the this pointer or arrays.

Then in your class constructor, add

#pragma acc enter data create(this)

After you allocate the arrays in the class, then add:

#pragma acc enter data create(AX_array[:nx][:ny][:nz], A_T_array[:nx][:ny][:nz])

replacing nx, ny, and nz with the actual size of each dimension. Also, you may need to break this in two enter data directive if the arrays are not allocated in the same code. Also alternatively, you can use a “copyin” instead of “create” for A_T_array if the array is also assigned values. If not, you’ll want to use an update directive to copy the assigned values on the device.

#pragma acc update device(A_T_array[:nx][:ny][:nz])

Thanks very much for these comments. The role of ’this’ makes sense. The arrays in the class are allocated in the constructor using nested ’new’ calls, and deallocated in the destructor with nested ‘delete’. The calls in the CPU version just pass pointers (if calling outside the class) to avoid extensive data transfer, and the arrays are simply de-referenced by brackets [jx][jy][jz].

Some followup:

First, I need some clarification: the documentation and your email refer to array notation [:nx], which I assume is short for [0:nx]. But the bigger question for me concerns arrays that are allocated at run time (part of the input is the number of grid points in each dimension). I assume I can’t put a run-time variable into a compiler directive.

Second, I can try out these comments within the class definition, but it almost seems easier for me to take the function in question outside the class so as to avoid the ’this’ problem. This seems a bit counterintuitive.

Concerning both of these points, I have another instance within the code (that I did not post) involving a call to a member function with arrays in a different class, with no concern about ’this’ and no concern about the run-time array sizes. As best I can tell, OpenACC compiled the code and gave results that are identical to those using only CPU (and possibly OpenMP).

Correct, if the lower bound is omitted, a default value of “0” is used in C/C++ and “1” in Fortran.

I assume I can’t put a run-time variable into a compiler directive.
No, you can use variables for the array shape bounds. They do not need to be static or constant values.

I can try out these comments within the class definition, but it almost seems easier for me to take the function in question outside the class so as to avoid the ’this’ problem. This seems a bit counterintuitive.

Whatever works best for you, though whenever referencing a class data member on the device either implicitly through the hidden pointer, or explicitly through a class object, you will still need to to have the “this” pointer on the device as well. Of course if you’re passing the data member as an argument, then you’re not accessing it through the class object any longer.

Trying to implement this, think I got some syntax wrong.

At the beginning of the constructor (before allocating arrays), I inserted:

#pragma acc enter data create(this)

I then got a series of errors as I called the allocator:
333, Generating enter data create(this[:1])
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)
NVC+±S-0000-Internal compiler error. pragma: bad ilmopc 307 (SiteRepairATMC_1.5.cc: 523)

Then for the loop in question, I wrote (using copy because arrays have data that are updated in the loops)

#pragma acc kernels default(present)
#pragma acc loop independent collapse(3)
#pragma acc data copy(A_T_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
A_T_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
AX_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
PX_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
P_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
B_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
X0_T_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
X1_T_array[MaxPoints.x][MaxPoints.y][MaxPoints.z],
Xm_array[MaxPoints.x][MaxPoints.y][MaxPoints.z])

for (int jx = 0; jx < MaxPoints.x; jx++)
  for (int jy = 0; jy < MaxPoints.y; jy++)
	for (int jz = 0; jz < MaxPoints.z; jz++) {

This seemed to compile OK:

523, Generating default present(X1_T_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],P_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],PX_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],AX_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],B_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],A_T_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],X0_T_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y],Xm_array[:][:][:],X1_T_array[:][:][:],X0_T_array[:][:][:],B_array[:][:][:],P_array[:][:][:],PX_array[:][:][:],AX_array[:][:][:],A_T_array[:][:][:],this[:],Xm_array[__b_17IntegrationMethod.MaxPoints.x][__b_17IntegrationMethod.MaxPoints.y])
536, Loop is parallelizable
537, Loop is parallelizable
538, Loop is parallelizable
     Generating Tesla code
    536, #pragma acc loop gang, vector(128) collapse(3) /* blockIdx.x threadIdx.x */
    537,   /* blockIdx.x threadIdx.x collapsed */
    538,   /* blockIdx.x threadIdx.x collapsed */

Could be something simple, but I’m not familiar with OpenACC syntax.

Found obvious mistake: forgot the colons in the array ranges. Fixed some other statements to match yours.

The code now runs, but at least some of the arrays have all zeros.

As I understand it,

#pragma acc enter data reserves space on the device
#pragma acc update device() copies the current array values onto the device

I tried adding

#pragma acc update host()

at the end of the routine, but I’m still getting zeros, in contrast to the working CPU version.

So something is not copied correctly to the device or back out. Put another way - a function call outside the class with similar array issues is handled by OpenACC with minimal #pragma directives; I need to reproduce the implicit actions.

I think I’ll need a minimal reproducing example to better understand what’s going on.

OK - Here is a self contained test file:

ACCtest.cpp (1.1 KB)

Here is the output after compiling with g++ (no ACC option):

[0][0] element: 1
[0][0] element: 2
[0][0] element: 3
[0][0] element: 4
[0][0] element: 5
[0][0] element: 6
[0][0] element: 7
[0][0] element: 8
[0][0] element: 9
[0][0] element: 10

…and here is the output after compiling with nvc++ -acc -Minfo=accel ACCtest.cc

[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0
[0][0] element: 0

Accelerator Kernel Timing data
/home/bkeister/g++/app/SiteRepair-GPU/ACCtest.cc
_ZN9TestClassC1Ev NVIDIA devicenum=0
time(us): 17
16: data region reached 1 time
23: data region reached 1 time
23: data copyin transfers: 1
device time(us): total=17 max=17 min=17 avg=17
35: kernel launched 10 times
grid: [1] block: [128]
elapsed time(us): total=168 max=28 min=15 avg=16
/home/bkeister/g++/app/SiteRepair-GPU/ACCtest.cc
_ZN9TestClass15ManipulateArrayEv NVIDIA devicenum=0
time(us): 0
44: compute region reached 10 times
48: kernel launched 10 times
grid: [1024] block: [128]
elapsed time(us): total=1,252 max=138 min=123 avg=125
44: data region reached 20 times

A more general question: I’m confused about the concept of Unified Memory. I’m running a video card with Pascal architecture (not very fast, but I can check out the code). Earlier I tried running OpenACC on a Jetson Nano and got a message that it was not supported, I presume partly because Maxwell architecture doesn’t have Unified Memory. But if memory is unified, why must there be directives to move data back and forth between CPU and GPU? OpenACC “supports” C++, but handling data inside a class seems very awkward, especially since the concept of unified memory suggests that data transfer would not be needed inside a class.

My ultimate goal is to minimize transfers of large arrays back and forth between CPU and GPU, because that will kill any performance benefit of a GPU compared to a multi-threaded CPU running OpenMP. But I can’t tell which directives trigger a data transfer, which ones might be ignored if data transfer is not actually needed, and how much of this depends upon the architecture.

Hi xtz465,

Two problems. Data regions can’t be inside of a compute region and you forgot to copy the data back from the device.

--- ACCtest.org.cpp     2020-07-16 08:01:00.801492000 -0700
+++ ACCtest.cpp 2020-07-16 08:03:31.725371000 -0700
@@ -40,14 +40,15 @@

   void ManipulateArray() {

+#pragma acc update device(Array[:nx][:ny])
 #pragma acc kernels default(present)
     {
 #pragma acc loop independent collapse(2)
-#pragma acc update device(Array[:nx][:ny])
   for (int jx = 0; jx < nx; jx++)
       for (int jy = 0; jy < ny; jy++)
        Array[jx][jy] += 1;
     }
+#pragma acc update self(Array[:nx][:ny])
   }
 };

C++, but handling data inside a class seems very awkward, especially since the concept of unified memory suggests that data transfer would not be needed inside a class.

Due to encapsulation I would think you’d want to have the class manage it’s own data, especially if any of the data members are private. I’d suggest adding methods to the class to handle the data movement from the host and device so you can have more control over when the data is copied. You don’t need to follow this but one of the examples I wrote for my chapter in the book Parallel Programming with Open ACC, shows a basic container class to manage OpenACC data: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/accList.h

I tried running OpenACC on a Jetson Nano and got a message that it was not supported, I presume partly because Maxwell architecture doesn’t have Unified Memory.

Possible, but what was the exact message? Though maybe you just need update your CUDA driver to >= 450.36 which is the minimum driver version required? See: https://docs.nvidia.com/hpc-sdk/hpc-sdk-release-notes/index.html#platform-requirements

My ultimate goal is to minimize transfers of large arrays back and forth between CPU and GPU,

That’s a good goal. Ideally you’d offload all the compute to the GPU so the data only needs to be moved once to the device at the beginning of the program, and the results are copied back once at the end. Though, most programs will have at least some intermediary data movement.

But I can’t tell which directives trigger a data transfer, which ones might be ignored if data transfer is not actually needed, and how much of this depends upon the architecture

Data regions are optional depending on the target architecture. For example when targeting a multi-core CPU, the data regions are completely ignored, and when targeting a GPU with CUDA Unified Memory, the data movement is ignored for dynamically allocated data.

If your code’s data structure is very complex and/or uses container classes such as std::vector, it often easier to stick to using Unified Memory. Otherwise, I suggest using explicit data regions to manage the device data. The data transfer occurs are the beginning or end of a data region depending on which clause is used and if the data is already present on the device or not. Data transfer can also occur within the score of the a data region via an “update” directive, and the data must be present on the device in order for it to be used in an “update” directive.

Mat,

Thanks for all your helpful comments. At this point I will tinker with the toy model to see what works best for me in terms of data management (aka minimizing copies back and forth).

The error message I get on the Jetson Nano is at run time:

Failing in Thread:1
call to cuModuleLoadDataEx returned error 209: No binary for GPU

Also my understanding is that the Nano has some sort of builtin driver that comes with the basic software package; there is nothing to download.

Your comment about driver 450.36 also prompted another question: do I need to use this instead of a 440 vintage? That’s what is running on my AMD box with GT 710 graphics card.

Brad

Hi Brad,

Your comment about driver 450.36 also prompted another question: do I need to use this instead of a 440 vintage? That’s what is running on my AMD box with GT 710 graphics card.

No, if you look at the chart from the link to the release notes, you’ll see that the CUDA 11 driver (450.36) is only needed for ARM architectures. For x86, the minimum driver version is 418.39.

 Failing in Thread:1
call to cuModuleLoadDataEx returned error 209: No binary for GPU

Also my understanding is that the Nano has some sort of builtin driver that comes with the basic software package; there is nothing to download.

Ok, so I don’t have a Nano here, nor do we officially support Nano with the HPC compilers, but you might be able to get this to work by setting the environment variable “CUDA_HOME” to the installation of the system’s CUDA install that matches the driver version and then add the flag “-gpu=cc50” to the compile line so the compiler will target a Maxwell device.

-Mat

I managed to get my full code to compile and run using -ta=tesla:managed. However, I got the following message at runtime:

pool allocator: Specified pool size too big for this device

I was unable to look this up. The executable took a very long time to run. My low-end GT 710 card has 1 GB memory. Is that the problem?

Calling cudaMallocManaged (the CUDA API that is used to allocate unified memory) does incur a high overhead cost. This is particularly true for C++ codes which often have many small allocations.

To help with this, we create a unified memory pool so memory can be reused and can significantly reduce the overhead costs. Details can be found at: https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#acc-mem-unified

The default pool size is 1GB and why you’re getting this warning. The code will still run, but just fall back to calling cudaMallocManaged directly instead of using the pool. While I don’t know for sure, we’d need to profile your code to know, it could be the cause of the poor performance.

You might try setting the pool size to 500MB by setting the environment variable “PGI_ACC_POOL_SIZE=500MB” (or NVCOMPILER_ACC_POOL_SIZE for the nvc++ compiler), to see if this helps.