For the past few weeks we have been writing a LJ code that calculates the Lennard Jones energy of a series of unique atom interactions. Our first kernel calculates the lj energy of each pair and puts it’s value in a vector. Previously we just copied this large array back to the cpu and sum it there which was rather inefficient. I have tried to add another kernel that takes this large vector and reduces in on the gpu. The code compiles fine however when I try to test the program with a small system I get the error.
Failed to copy vector d_g_odata from device to host (error code invalid argument)!
Here is the reduction kernel. I am assuming the previous kernel worked and copying the vdwe array was also sucessful. It was working before.
__global__ void
reduce1(double *g_idata, double *g_odata, int N)
{
extern __shared__ double sdata[];
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = (i < N) ? g_idata[i] : 0;
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2)
{
int index = 2 * s * tid;
if (index < blockDim.x)
{
sdata[index] += sdata[index + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Here are the commmands at the kernel launch. Where N is the number of unique atom interactions.
dim3 threadsPerBlock (imin(numatoms,32),imin(numatoms,32));
dim3 blocksPerGrid ((numatoms+32-1)/32,(numatoms+32-1)/32);
printf("The number of threadblocks is %d*%d \n", (numatoms+32-1)/32,(numatoms+32-1)/32);
clock_t t;
t = clock();
lje<<<blocksPerGrid, threadsPerBlock>>>( d_atype, d_atomx, d_atomy, d_atomz, d_vdwe, numatoms );
err = cudaMemcpy(vdwe, d_vdwe, N*sizeof(double), cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector vdwe from device to host (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy( d_g_idata, vdwe, N*sizeof(double), cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector d_g_idata from host to device (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
//dim3 tPB (imin(N,32),imin(N,32));
//dim3 bPG ((N+32-1)/32,(N+32-1)/32);
int tPB = 1024;
dim3 bPG = (N+tPB-1)/(tPB);
reduce1<<<tPB, bPG>>>( d_g_idata, d_g_odata, N );
t_vdwe = (double*) calloc(((N+32-1)/32*(N+32-1)/32),sizeof(double));
err = cudaMemcpy(d_g_odata, t_vdwe, ((N+32-1)/32*(N+32-1)/32)*sizeof(double), cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector d_g_odata from device to host (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
t = clock() - t;
If anyone wants to dig through my entire code I can post it but is it around 400 lines.