Why isn't my device data being transferred to the host?

Hello,

I am currently having exceptional difficulty in copying and reading an array which the device sends back to the host. When I attempt to read the data which I am supposed to have returned to me, all I get is junk data. Could anyone take a look at my code snippets and tell me what I’m doing wrong? Thank you very much!

struct intss {
u_int32_t one;
u_int32_t two;
};

int main()
{
int block_size = 3;
int grid_size = 1;

    intss *device_fb = 0;
    intss *host_fb = 0;


    int num_bytes_fb = (block_size*grid_size)*sizeof(intss);
   

host_fb = (intss*)malloc(num_bytes_fb); 
cudaMalloc((void **)&device_fb, num_bytes_fb);

    ....

    render2<<<block_size,grid_size>>>(device_fb, device_pixelspercore, samples, obj_list_flat_dev, numOpsPerCore, lnumdev, camdev, lightsdev, uranddev, iranddev);


    ....

   cudaMemcpy(host_fb, device_fb, num_bytes_fb, cudaMemcpyDeviceToHost);


   printf("output %d ", host_fb[0].one);

   printf("output %d ", host_fb[1].one);

   printf("output %d ", host_fb[2].one);   
   //Note that I'm only looking at elements the 3 elements 0-2 from host_fb. I am doing this because block_size*grid_size = 3. Is this wrong?

    cudaFree(device_fb);
    free(host_fb);

}

global void render2(intss *device_fb, struct parallelPixels *pixelsPerCore, int samples, double *obj_list_flat_dev, int numOpsPerCore, int lnumdev, struct camera camdev, struct vec3 *lightsdev, struct vec3 *uranddev, int *iranddev) //SPECIFY ARGUMENTS!!!
{
int index = blockIdx.x * blockDim.x + threadIdx.x; //DETERMINING INDEX BASED ON WHICH THREAD IS CURRENTLY RUNNING

....

//computing data...


device_fb[index].one = (((u_int32_t)(MIN(r, 1.0) * 255.0) & 0xff) << RSHIFT |   
                  ((u_int32_t)(MIN(g, 1.0) * 255.0) & 0xff) << GSHIFT |
                  ((u_int32_t)(MIN(b, 1.0) * 255.0) & 0xff) << BSHIFT);

}

Check the return codes of all Cuda calls. Probably the error is not in the cudaMemcpy() though, but in the kernel preceding it (kernel errors are reported only later due to the asynchronous invocation).

I recently found out that if I set device_fb[index] = index; in the kernel function, I am able to get the correct results. However, after calling all of my device functions to compute the output to go into device_fb[index], I am still getting junk data when copying back to the host.

I have tried to check the return codes of all Cuda calls, and I seem to be getting “invalid argument” on a few of my cudaMemcpy calls. I am not sure what is the invalid argument the error is referring to. I did not receive an error when attempting to malloc the device arrays which would have data transferred into them from the host, so I wonder if it’s correct to assume my device arrays would not be causing the “invalid argument” error. I have pasted my code to the following pastebin link: http://pastebin.com/UgzABPgH

The lines which produce errors when tested are 416, 431, 443, 452, 468.

Thank you very much!

Two common reasons for “invalid argument” on cudaMemcpy() calls are:

[1] Passing a device pointer instead of a host pointer and vice-versa
[2] Requesting a transfer size that exceeds the allocation size of the device memory object pointed to by a device pointer

Thanks for the response! I’ve verified that the transfer size is equal to the allocation size of the memory object pointed to by the device pointer. I’m also pretty sure that I’m not reversing the positions of where the device pointer is and where the host pointer is. I tried reversing these values on a call to

cudaMemcpy(&obj_list_flat_dev, &obj_list_flat, (sizeof(double) * objCounter * 9), cudaMemcpyHostToDevice); 

just in case, but I still got the “invalid argument” error.

Do you need the & operators in there? cudaMemcpy takes pointers, not pointers to pointers, so if obj_list_flat_dev is a pointer, then you don’t want to take the address of a pointer.