Copy 2 dimensional structure to device

Hi

Im trying to copy a 2 dimensional struct to the device. It compiles without errors but gives me an illegal memory access when I access the struct at the second cudaEventSynchronize below. On the host I use it to read in a file which works perfect. When I use the 1 dimensional version all works well. Any pointers would be greatly appreciated.

typedef struct nConnType
{
  double *nCondu;
  int nCount;
} nConnType;

nConnType *nConns[2];
nConnType *dev_nConns[2];
double *dev_nCo[2];

for(int i=0;i<2;i++)
   nConns[i] = (nConnType *) malloc(*size/2 * sizeof(nConnType));
	
   getConns(nConns);

//copy values to device in array dev_neighCo and link to struct
for(int i = 0; i <2; i++){
  CUDA_CHECK_RETURN(cudaMalloc( (void**) &dev_nConns[i], *size/2 *sizeof(nConnType)));
  for(int j = 0; j < *size/2; j++){
    CUDA_CHECK_RETURN(cudaMalloc( (void **) &dev_nCo[i], nConns[i][j].nCount * sizeof(double) ) );
    CUDA_CHECK_RETURN(cudaMemcpy( dev_nCo[i], nConns[i][j].nCondu, nConns[i][j].nCount * sizeof(double), cudaMemcpyHostToDevice ) );
    CUDA_CHECK_RETURN(cudaMemcpy( &dev_nConns[i][j].dev_nCo, &dev_nCo[i], sizeof(double *), cudaMemcpyHostToDevice));
    CUDA_CHECK_RETURN(cudaMemcpy( &dev_nConns[i][j].nCount, &nConns[i][j].nCount, sizeof(int), cudaMemcpyHostToDevice ) );
    }
}

CUDA_CHECK_RETURN(cudaEventSynchronize(cDone[0]));
compute_kernel <<< gridDim, blockDim, 0, stream[0] >>>(dev_nConns);
CUDA_CHECK_RETURN(cudaEventSynchronize(cDone[0]));

__global__ void compute_kernel(nConnType * __restrict__ dev_nConns[2]) 
{
   int threadNum = (j**IO_NETWORK_DIM1_D+k)/2;

   for (int i=0; i<dev_nConns[0][threadNum].nCount; i++)
      printf("nib%d con%f \n", dev_nConns[0][threadNum].nCondu[i]);
}

...

If I really understood the loop starting at line 17, you are doing the nested memory allocation that, after dealing with dynamic 1D array of dynamic 1D arrays, the simpler solution is to use a flattened 1D array technique.
It is messy in C/C++ and in CUDA too. The other problem I see is that you are doing an element-by-element cudaMemcpy, which gets more inefficient the more iterations it is done.

A few solutions would be:

  • If you really want to do the host-device-host copy yourself, switch from a 2D array notation to a flattened 1D notation, as you will only need to use a small formula to loop through the rows. And then do a host-device copy with one single cudaMemcpy operation copying the entire dataset, like:
cudaMemcpy(h_object, d_object, N * sizeof(nConnType), cudaMemcpyHostToDevice);

Of course this is just a concept and not actual working code, so you would have to adjust it.

  • Use cudaMallocManaged memory, where the driver is responsible for moving the dataset to host if you reference to it from the host, and to device if you reference it from the device.
  • If you are a regular STL user, you may want to check Thrust.
  • If you really want to stick to 2D notation, there is cudaMemcpy2D(), but I really discourage you to follow a more complicated route. Flattened 1D arrays will always work, specially if you have to pass them to functions.

Thanks for the help.

I have to do the host to device element by element copy because the to transfer datasize is different every iterations. Since cudamallocmanaged doesn’t have a realloc functional do this efficiently. Also a priori I don’t know what space to allocate. Retrieving a flattened 1D scenario is not really a ‘small’ formula I guess.