Error 13 from cudaMemcpyToSymbol

dscerutti · September 26, 2020, 8:35pm

I feel bad asking this. I know from the API pages that cudaerror_t = 13 “indicates that the symbol name/identifier passed to the API call is not a valid name or identifier.” But, I am mirroring the steps I took in another program to do something very similar.

The data types:

struct GpuRismManagement {
  int istep;                 // Number of steps since start/restart
  int np;                    // Number of data points (first dimension length)
  int nvec;                  // Maximum number of vectors
  double restart;            // Ratio of the current residual to the minimum residual found
                             //   that causes a restart
  double delta;              // Coefficient for residual gradient, between 0 and 1.
  double tol;                // Target residual
  float* overlap;            // The overlap matrix
  float* ri;                 // An array of vector data, np by nvec
  float* xi;                 // An array of residual data, np by nvec
  int* vecMap;               // Maps vectors order onto ri and xi.  E.g. vecMap(0) gives the
                             //   index of xi or ri that is most recently updated
};
typedef struct GpuRismManagement gpuRismInfo;

struct HostRismManagement {
  int istep;                 // Number of steps since start/restart
  int np;                    // Number of data points (first dimension length)
  int nvec;                  // Maximum number of vectors
  double restart;            // Ratio of the current residual to the minimum residual found
                             //   that causes a restart
  double delta;              // Coefficient for residual gradient, between 0 and 1.
  double tol;                // Target residual
  gpuFloat overlap;          // The overlap matrix
  gpuFloat ri;               // An array of vector data, np by nvec
  gpuFloat xi;               // An array of residual data, np by nvec
  gpuInt vecMap;             // Maps vector order onto ri and xi.  E.g. vecMap(0) gives the
                             //   index of xi or ri that is most recently updated
  gpuRismInfo gpuConst;      // See type definition above.  This hold constants (including
                             //   pointers) that the GPU will need in order to carry out work.
};

GpuFloat and GpuInt are data types I have defined that track partnered arrays of floats and ints on the host and on the device, with associated functions for downloading and uploading the data. Note that a gpuRismInfo type lives inside a hostRismInfo type, and the gpuRismInfo type contains pointers to the device-side arrays of the GpuFloat and GpuInt types inside the hostRismInfo type.

1.) In one code file, let’s call it MyCudaUnit.cu, I declare up at the top “device constant gpuRismInfo cRISM;” and the function:

//---------------------------------------------------------------------------------------------
// SetRismImage: function to establish a GPU RISM data structure on the device, with pointers
//               to all of the device-allocated memory as well as constants describing the
//               simulation conditions.
//
// Arguments:
//   hRISM:     the repository for all parameters and data on the host
//---------------------------------------------------------------------------------------------
extern "C" void SetRismImage(hostRismInfo *hRISM)
{
  cudaError_t status;

  status = cudaMemcpyToSymbol((void**)&cRISM, &hRISM->gpuConst, sizeof(gpuRismInfo));
  if (status != cudaSuccess) {
    printf("SetRismImage >> Unable to copy gpuRismInfo struct to the "
           "device (error %d).\n", (int)status);
    exit(1);
  }
}

2.) In another code file, let’s call it MyCLayerThatCallsCuda.c, I declare up at the top static “hostRismInfo hRISM;” and a function in there that calls “SetRismImage(&hRISM);”

3.) In a third code file, let’s call it MyFortranLayerWhichCallsC.F90, I call the function in MyCLayerThatCallsCuda.c that in turn calls SetRismImage(&hRISM).

But, when I run this, I am getting that dreaded error 13. Anyone have ideas? I also noticed that in other codes my team and I have omitted the (void**)& in front of the hRISM such that the device constant is passed as a value, not a void pointer, as the first argument of cudaMemcpyToSymbol. If I do that, which again works in other codes, I now get cudaerror_t 77, and I don’t know what that means from reading the API documentation.

Dave

Robert_Crovella · October 16, 2020, 1:18am

You definitely should not ever do that. It cannot ever be correct syntax. You don’t take the address of a device symbol. When passing a device symbol to cudaMemcpyToSymbol(), just use the symbol name directly:

cudaMemcpyToSymbol(cRISM, ...

Depending on the CUDA version you are using, CUDA (runtime) error 77 refers to an illegal address error. This always originates from device code, so it means that a kernel launched prior to the cudaMemcpyToSymbol operation hit an illegal address. Run your code with cuda-memcheck:

You might struggle less with deciphering errors if you use the convenient, available, handy, built-in error to string converter provided by CUDA:

printf("SetRismImage >> Unable to copy gpuRismInfo struct to the 
       device (error %s).\n", cudaGetErrorString(status));