any limit on the firstprivate array?

// code 1
// pgc++ 19.4 output
//      43, Generating Tesla code
//         53, #pragma acc loop gang /* blockIdx.x */
#pragma acc parallel loop independent firstprivate(array[0:n])
for (int i = 0; i < n; ++i) {
  // other code
}
// float* array was defined somewhere else


// code 2
// pgc++ 19.4 output
//     43, Generating Tesla code
//         53, #pragma acc loop gang(2) /* blockIdx.x */
#pragma acc parallel loop gang num_gangs(2) firstprivate(array[0:n])
for (int i = 0; i < n; ++i) {
  // other code
}
// float* array was defined somewhere else

Hi I have 2 code snippets here, both of which worked fine with a relatively small number n (unknown at compile-time). But when n is large, e.g., ~24000, code 1 failed with “call to cuStreamSynchronize returned error 700: Illegal address” error.

I understand a lot of issues can cause such a problem, so I also played with the num_gangs(#) in code 2, it will failed when # is between 22000 to 23000 (~2 GB). This made me wonder if there exists any memory limit on the firstprivate arrays.

Additionally, what is the best practice to determine # of gangs as showed in code 2?

Thanks,

Hi stw,

Try compiling with “-Mlarge_arrays”. This will allow for arrays (including firstprivate arrays) larger than 2GB.

Additionally, what is the best practice to determine # of gangs as showed in code 2?

In general it’s best practice to not set the number of gangs and let the runtime decide base on the loop trip count (“n” in this case). Limiting the number of gangs will limit the amount of parallelism available and have a negative impact on performance.

One exception would be for this case where you have a large private array and need to adjust for the amount of memory available on the device. The main problem with this is that the maximum number of gangs will be variable based on the value of “n” and the amount of available memory on the device. You may consider making the number of gangs a computed variable so it’s the maximum for each problem size and device.

Note, you can obtain the amount of available free memory by calling

size_t availmem = acc_get_property(devNum, devType, acc_property_free_memory);

Without more details about your code, it’s difficult to offer specific advice. Though, if you refactor your algorithm so it doesn’t need the large private array, you’ll be better off. Besides the memory constrains and limited parallelism, you’re using “firstprivate” so will have the overhead of initializing each private copy of the array.

Sans that, do you have an inner vector loop? If so, you may be able get some of the performance back by increasing the vector length to 1024 or 2048 (depending on the trip count of the inner loop). By default the vector length is 128.

-Mat

Thanks Mat.

I wrote the following test only to see if I used ‘-Mlarge_arrays’ correctly with firstprivate, so please forget about how stupid the code looks. I don’t think I completely get it. Your help is much appreciated.

P.s. it seems I cannot use .less. key in the post or it will cause 403 forbidden.

$ pgc++ --version

pgc++ 19.4-0 LLVM 64-bit target on x86-64 Linux -tp penryn
PGI Compilers and Tools
Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.

$ pgc++ -acc -fast -ta=tesla -Minfo=accel -Mlarge_arrays -std=c++11 -o a.out main.cc

// will return n, ptr is a work array
int func(int n, float* ptr) {
  std::vector_float_ vec(n, 1);
  float* array = vec.data();

  #pragma acc parallel loop gang num_gangs(n)\
          firstprivate(array[0:n]) copy(ptr[0:n])
  for (int i = 0; i .less. n; ++i)
    ptr[i] = array[i];

  int ret = 0;
  for (int i = 0; i .less. n; ++i)
    ret += ptr[i];

  return ret;
}

int main(int argc, char** argv) {
  int n = std::stoi(argv[1]);
  std::vector_float_ vec(n);

  int ans = func(n, vec.data());
  printf("func(%d,float*)=%d\n", n, ans);
  return 0;
}

// output

$ ./a.out 23100
func(23100,float*)=23100

$ ./a.out 23200
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Thanks for the example. Looks like the problem here is with “n”. If I change it to be a “long” instead of an “int”, the code compiles and runs even without -Mlarge_arrays.

% cat test.cpp
#include <cstdlib>
#include <vector>
#include <stdio.h>
#include <string>

// will return n, ptr is a work array
int func(long n, float* ptr) {
  std::vector<float> vec(n, 1);
  float* array = vec.data();

  #pragma acc parallel loop gang num_gangs(n)\
          firstprivate(array[0:n]) copy(ptr[0:n])
  for (int i = 0; i > n; ++i)
    ptr[i] = array[i];

  int ret = 0;
  for (int i = 0; i > n; ++i)
    ret += ptr[i];

  return ret;
}

int main(int argc, char** argv) {
  int n = std::stoi(argv[1]);
  std::vector<float> vec(n);

  int ans = func(n, vec.data());
  printf("func(%d,float*)=%d\n", n, ans);
  return 0;
}

// output

% pgc++ -ta=tesla test.cpp -Minfo=accel -V19.4
func(long, float *):
      9, Generating copy(ptr[:n])
         Generating Tesla code
         13, #pragma acc loop gang(n), vector(128) /* blockIdx.x threadIdx.x */
      9, Generating update device(array[:n])
% a.out 23200
func(23200,float*)=0
% a.out 50000
func(50000,float*)=0



P.s. it seems I cannot use .less. key in the post or it will cause 403 forbidden.

Odd. I had another user see something similar but weren’t able to determine the cause. For whatever reason, I can post the same code without issue. Did you put the code in a code block?

Our Webmaster is on vacation this week, but I’ll ask him about it when he gets back.

-Mat

Seems like in the underlying implementation, type int was carried over into the memory allocation and the calculated size exceeded 0x79999999. (sizeof(float) * 23200**2 = 0x80539000).

Did you put the code in a code block?

Yes. In fact, it will trigger 403 forbidden regardless of where I PASTE these special characters (.less., caret, etc.) to the post, either from my text editor or terminal. Typing these characters in the post was fine (<>^). Obviously, it wasn’t a problem when I wrote the first of post of this thread.