CUDA error on cudaMemcpy() to host, when data is big

Barsaas · May 31, 2021, 10:52pm

Hi!
I am using CUDA to speed up the calculation of a 1-D convolution for two arrays. The device I am using is a Jetson Nano.

My program gets two arrays from Shared memory, and their size from command line. It then uses CUDA to calculate the convolution between A and B.

This is my code.

notice I cut the part where arrays are copied from shared memory. That part works fine and is big and messy

#define USECPSEC 1000000ULL
#define nTPB 256
#define mytype double



__global__ void conv_Kernel2(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *out, const int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (N)-1){
        mytype my_sum = 0;
        for (int i = 0; i < N; i++)
            if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
        out[idx] = my_sum;
    }
}


enum ARGS{ARG_NAME, ARG_VEC1_SIZE, ARG_VEC2_SIZE, ARG_NUM};

/**
*  this program gets via shared memory two arrays of double, called A and B, and gets as arguments the sizes of A and B.
*/
int main(int argc, char *argv[]){
   
    double *h_A, *d_A, *h_result, *d_result, *result, *h_B, *d_B;

    cout<<"initializing ..."<<endl;

    //----------------------------------------------------
    //get A and B (2 arrays of double) from shared memory copy them into h_A and h_B
    //----------------------------------------------------
    
         
    //Take the size of the two vectors from command line...
    arg_vecA_size=atoi(argv[ARG_VEC1_SIZE]); //size of A 
    arg_vecB_size=atoi(argv[ARG_VEC2_SIZE]); //size of B


    cout<<"Allocating memory ..."<<endl;
    //allocation of memory
    h_result = (double *)malloc((arg_vecB_size + arg_vecA_size) * sizeof(mytype));

    //Allocation of cuda memory

    if(cudaMalloc(&d_B, arg_vecB_size * sizeof(mytype)) != cudaSuccess){
        throw runtime_error("Error - cudaMalloc of d_B");
    };
    if(cudaMalloc(&d_A, arg_vecA_size * sizeof(mytype))!=cudaSuccess){
        throw runtime_error("Error - cudaMalloc of d_A");
    };
    if(cudaMalloc(&d_result, (arg_vecB_size + arg_vecA_size) * sizeof(mytype))!=cudaSuccess){
        throw runtime_error("Error - cudaMalloc of d_result");
    };

    for (int i=0; i < arg_vecB_size + arg_vecA_size; i++){
        h_result[i] = 0;
    }

    cout<<"Copying memory on device..."<<endl;
    //copy memory on device

    if(cudaMemset(d_result, 0, (arg_vecB_size + arg_vecA_size) * sizeof(mytype))!=cudaSuccess){
        throw runtime_error("Error on cudaMemcpy of d_result");
    };
    if(cudaMemcpy(d_A, h_A, arg_vecA_size * sizeof(mytype), cudaMemcpyHostToDevice)!=cudaSuccess){
        throw runtime_error(" Error on cudaMemcpy of d_A");
    };

    if(cudaMemcpy(d_B, h_B, arg_vecB_size * sizeof(mytype), cudaMemcpyHostToDevice)!=cudaSuccess){
        throw runtime_error("Error on cudaMemcpy of d_B");
    };

    cout<<"Launching Kernel..."<<endl;
    conv_Kernel2<<<((arg_vecB_size + arg_vecA_size-2)+nTPB-1)/nTPB, nTPB>>>(d_A, d_B, d_result, arg_vecB_size + arg_vecA_size);

    cudaDeviceSynchronize();
    int error;
    error = cudaMemcpy(h_result, d_result, (arg_vecB_size + arg_vecA_size)*sizeof(mytype), cudaMemcpyDeviceToHost);

    if( error != cudaSuccess){
        cerr << "error number is "<< error << endl;
        throw runtime_error("Error on cudaMalloc of d_result back to host");
    };
    cudaDeviceSynchronize();

    return 0;
}

This code works correctly with small A and B.

Problem is when A and B are big (e.g. sizes of 705830 elements in A and 794029 in B , so h_result size is 560449489070 elements)

In that case I get the exception runtime_error("Error on cudaMalloc of d_result back to host"); , corresponding to the cudamemcpy from device to host.

The value returned by cudamemcpy is 702 , that corresponds to cudaErrorLaunchTimeout : this makes me think the process of copying the full array back to RAM is too slow to be completed in time…

Does anyone have an idea of what could be causing it? And how to resolve it? Thanks

Robert_Crovella · June 1, 2021, 1:06am

cross posting here

Barsaas · June 1, 2021, 2:08am

I thought it was better to post the question in two different places, since I didn’t find much information online…Is there something wrong with this?
If yes I will delete this post :)

Robert_Crovella · June 1, 2021, 1:18pm

No, its fine. But from my perspective, if I knew you had asked the question in 2 different places, I would want to see what may have been answered in the other location before spending any time on it. Therefore this comment is not for you (as you already know about it) but for the benefit of others. This is a community.

In this case, you have a response in the other location.

srikanthkvs · September 23, 2023, 2:21am

HI Robert,
I notice that the link shared for reference in the comments no more exists. I share the same problem as mentioned in above question. Can you give any suggestions on how to mitigate the cudamemcpy issue when the data transfer is huge. Thanks

Robert_Crovella · September 23, 2023, 4:28am

It’s a launch timeout. You can search on “cuda launch timeout”. here is an example of what you may find with such a search. It means that a kernel launch prior to the cudaMemcpy operation timed out, and was terminated by the CUDA runtime. If the value returned by cudaMemcpy is not 702 in your case, then your case is not related to this thread. This thread and issue have nothing to do with the size of the data transfer, even though that indication happens to be in the title.

Topic		Replies	Views
matrix multiplication with its transpose in cuda(cudamemcpy from device to host not working) . CUDA Programming and Performance	6	1754	October 5, 2018
cudaMemcpy() returns success and copy incorrect data CUDA Programming and Performance	3	2134	March 4, 2017
cudaMemcpy Failing To Copy Variable From Device To Host Correctly CUDA Programming and Performance	3	2737	April 26, 2021
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2633	August 2, 2010
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
Copying 2D array from host to device CUDA Programming and Performance	7	7234	July 27, 2010
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	598	March 8, 2023
cudaMemcopy bug CUDA Programming and Performance	3	1520	September 19, 2011
when large number stack overflow CUDA Programming and Performance	1	1079	October 19, 2017
Error when attempting to use cudaMemCpy() CUDA Programming and Performance	5	53	August 9, 2024

CUDA error on cudaMemcpy() to host, when data is big

Related topics