Vector Reduction cudaMemcpy() (error code invalid argument)!

hinshelwood · December 4, 2013, 6:34pm

For the past few weeks we have been writing a LJ code that calculates the Lennard Jones energy of a series of unique atom interactions. Our first kernel calculates the lj energy of each pair and puts it’s value in a vector. Previously we just copied this large array back to the cpu and sum it there which was rather inefficient. I have tried to add another kernel that takes this large vector and reduces in on the gpu. The code compiles fine however when I try to test the program with a small system I get the error.

Failed to copy vector d_g_odata from device to host (error code invalid argument)!

Here is the reduction kernel. I am assuming the previous kernel worked and copying the vdwe array was also sucessful. It was working before.

__global__ void
reduce1(double *g_idata, double *g_odata, int N)
{
    extern __shared__ double sdata[];

    // load shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

    sdata[tid] = (i < N) ? g_idata[i] : 0;

    __syncthreads();

    // do reduction in shared mem
    for (unsigned int s=1; s < blockDim.x; s *= 2)
    {
        int index = 2 * s * tid;

        if (index < blockDim.x)
        {
            sdata[index] += sdata[index + s];
        }

        __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Here are the commmands at the kernel launch. Where N is the number of unique atom interactions.

dim3 threadsPerBlock (imin(numatoms,32),imin(numatoms,32));
 dim3 blocksPerGrid ((numatoms+32-1)/32,(numatoms+32-1)/32);

 printf("The number of threadblocks is %d*%d \n", (numatoms+32-1)/32,(numatoms+32-1)/32);
 clock_t t;
 t = clock();
 lje<<<blocksPerGrid, threadsPerBlock>>>( d_atype, d_atomx, d_atomy, d_atomz, d_vdwe, numatoms );

  err = cudaMemcpy(vdwe, d_vdwe, N*sizeof(double), cudaMemcpyDeviceToHost);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector vdwe from device to host (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }
 
 err = cudaMemcpy( d_g_idata, vdwe, N*sizeof(double), cudaMemcpyHostToDevice);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector d_g_idata from host to device (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }

 //dim3 tPB (imin(N,32),imin(N,32));
 //dim3 bPG ((N+32-1)/32,(N+32-1)/32);



 int tPB = 1024;
 dim3 bPG = (N+tPB-1)/(tPB);


 reduce1<<<tPB, bPG>>>( d_g_idata, d_g_odata, N );

t_vdwe  = (double*) calloc(((N+32-1)/32*(N+32-1)/32),sizeof(double));
 
 err = cudaMemcpy(d_g_odata, t_vdwe, ((N+32-1)/32*(N+32-1)/32)*sizeof(double), cudaMemcpyDeviceToHost);
 if (err != cudaSuccess)
    {
      fprintf(stderr, "Failed to copy vector d_g_odata from device to host (error code %s)!\n", cudaGetErrorString(err));
      exit(EXIT_FAILURE);
    }
 
 t = clock() - t;

If anyone wants to dig through my entire code I can post it but is it around 400 lines.

allanmac · December 4, 2013, 7:16pm

It looks like your kernel is using dynamically allocated shared memory but your execution configuration doesn’t specify the number of bytes of shared memory to allocate per block.

I would expect your reduce1 kernel’s execution configuration would look something like this:

reduce1<<<tPB, bPG, sPB>>>

Where ‘sPB’ is shared bytes per block.