Vector Reduction cudaMemcpy() (error code invalid argument)!

For the past few weeks we have been writing a LJ code that calculates the Lennard Jones energy of a series of unique atom interactions. Our first kernel calculates the lj energy of each pair and puts it’s value in a vector. Previously we just copied this large array back to the cpu and sum it there which was rather inefficient. I have tried to add another kernel that takes this large vector and reduces in on the gpu. The code compiles fine however when I try to test the program with a small system I get the error.

Failed to copy vector d_g_odata from device to host (error code invalid argument)!

Here is the reduction kernel. I am assuming the previous kernel worked and copying the vdwe array was also sucessful. It was working before.

__global__ void
reduce1(double *g_idata, double *g_odata, int N)
{
    extern __shared__ double sdata[];

    // load shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[tid] = (i < N) ? g_idata[i] : 0;

    __syncthreads();

    // do reduction in shared mem
    for (unsigned int s=1; s < blockDim.x; s *= 2)
    {
        int index = 2 * s * tid;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }

        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Here are the commmands at the kernel launch. Where N is the number of unique atom interactions.

dim3 threadsPerBlock (imin(numatoms,32),imin(numatoms,32));
 dim3 blocksPerGrid ((numatoms+32-1)/32,(numatoms+32-1)/32);

 printf("The number of threadblocks is %d*%d \n", (numatoms+32-1)/32,(numatoms+32-1)/32);
 clock_t t;
 t = clock();
 lje<<<blocksPerGrid, threadsPerBlock>>>( d_atype, d_atomx, d_atomy, d_atomz, d_vdwe, numatoms );

  err = cudaMemcpy(vdwe, d_vdwe, N*sizeof(double), cudaMemcpyDeviceToHost);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector vdwe from device to host (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }
 
 err = cudaMemcpy( d_g_idata, vdwe, N*sizeof(double), cudaMemcpyHostToDevice);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector d_g_idata from host to device (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }

 //dim3 tPB (imin(N,32),imin(N,32));
 //dim3 bPG ((N+32-1)/32,(N+32-1)/32);



 int tPB = 1024;
 dim3 bPG = (N+tPB-1)/(tPB);


 reduce1<<<tPB, bPG>>>( d_g_idata, d_g_odata, N );

t_vdwe  = (double*) calloc(((N+32-1)/32*(N+32-1)/32),sizeof(double));
 
 err = cudaMemcpy(d_g_odata, t_vdwe, ((N+32-1)/32*(N+32-1)/32)*sizeof(double), cudaMemcpyDeviceToHost);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector d_g_odata from device to host (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }
 
 t = clock() - t;

If anyone wants to dig through my entire code I can post it but is it around 400 lines.

It looks like your kernel is using dynamically allocated shared memory but your execution configuration doesn’t specify the number of bytes of shared memory to allocate per block.

I would expect your reduce1 kernel’s execution configuration would look something like this:

reduce1<<<tPB, bPG, sPB>>>

Where ‘sPB’ is shared bytes per block.