call to cuMemcpyDtoHAsync returned error 1: Invalid value

Hello. I tried to execute examples for openacc and I got the following error:
call to cuMemcpyDtoHAsync returned error 1: Invalid value

This is my code:

#include <iostream>
using namespace std;
#include <cstdlib>
#include <cassert>
#include <chrono>
using namespace std::chrono;

// a sequential task 
// a task that runs in a thread
#pragma acc routine worker
inline char task(int index, int nLoop)
{
  long counter=0;
  for(long i=0; i < 1000000*nLoop; i++)
      counter += (i>=0)?1:0;

  // return 1 if the counter is correct
  return( ((counter/1000000) == nLoop)?1:0 );
}

int main(int argc, char *argv[])
{
  if(argc < 3) {
    cerr << "Use: nCount nLoop" << endl;
    return -1;
  }
  
  int nCount = atoi(argv[1]);
  int nLoop = atoi(argv[2]);
  
  if(nCount < 0 || nLoop < 0) {
    cerr << "ERROR: both nCount and nLoop must be greater than zero!" << endl;
    return -1;
  }
  
  high_resolution_clock::time_point t1 = high_resolution_clock::now();
  
  // Here is where we evaluate the task(s)
  int sum=0;
#pragma acc parallel loop reduction(+:sum)
  for(int i=0; i < nCount; i++)
    sum += task(i,nLoop);
  
  high_resolution_clock::time_point t2 = high_resolution_clock::now();
  duration<double> time_span = duration_cast< duration<double> >(t2 - t1);
  
  cout << "Duration " << time_span.count() << " second" << endl;
  cout << "Final sum is " << sum << endl;
  
  // Sanity check to see that array is filled with ones
  assert(sum == nCount);
  
  return 0; // normal exit
}

Also this is my pgi version:

pgc++ 18.10-1 64-bit target on x86-64 Linux -tp haswell.

What could be the problem?

Hi kspan,

For good or bad, I’m not able to reproduce the error. The code compiles and runs fine for me. The only error I see is when I give too big of a value for “nLoop” and “counter” overflows.

What device do you have? (I’m using a V100)
What compiler flags are you using?
What input values are you giving the program?

% pgc++ -V

pgc++ 18.10-1 64-bit target on x86-64 Linux -tp skylake
PGI Compilers and Tools
Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.

% pgc++ test1.cpp --c++11 -ta=tesla -Minfo=accel -fast
task(int, int):
     12, Generating Tesla code
         14, #pragma acc loop worker, vector /* threadIdx.y threadIdx.x */
     14, Loop is parallelizable
main:
     39, Accelerator kernel generated
         Generating Tesla code
         14, #pragma acc loop vector(128) /* threadIdx.x */
         15, Generating implicit reduction(+:..inline)
         39, Generating reduction(+:sum)
         41, #pragma acc loop gang /* blockIdx.x */
     39, Generating implicit copy(sum)
          14, Loop is parallelizable
% setenv PGI_ACC_TIME 1
% a.out 10000 1024
Duration 10.523 second
Final sum is 10000

Accelerator Kernel Timing data
/local/home/colgrove/test1.cpp
  main  NVIDIA  devicenum=0
    time(us): 9,690,309
    39: compute region reached 1 time
        39: kernel launched 1 time
            grid: [10000]  block: [128]
             device time(us): total=9,690,244 max=9,690,244 min=9,690,244 avg=9,690,244
            elapsed time(us): total=9,690,327 max=9,690,327 min=9,690,327 avg=9,690,327
        39: reduction kernel launched 1 time
            grid: [2]  block: [256]
             device time(us): total=15 max=15 min=15 avg=15
            elapsed time(us): total=100 max=100 min=100 avg=100
    39: data region reached 2 times
        39: data copyin transfers: 1
             device time(us): total=11 max=11 min=11 avg=11
        42: data copyout transfers: 1
             device time(us): total=39 max=39 min=39 avg=39



call to cuMemcpyDtoHAsync returned error 1: Invalid value
What could be the problem?

When I see this error, it typically is due to a bad pointer being used in device code, or some other similar error occurring in the kernel. Since you’re not using pointers in the device code, it’s most likely a problem with the reduction array for sum, but since I can’t reproduce the error, it’s unclear.

-Mat