Cuda 4-way Overlap Problem

(Solved)

Hi All:

I am a new comer and want to consult one question related to overlap. I tried to follow the method in:

nvidia overlap

Overlapping GPU<–>CPU copy and GPU by using buffer and cudaMemcpyAsync, CPU computation in order to improve performance.

However, I ran into one running error: “memory size of pointer value too large to fit in 32 bit in file *** in line ***”

Here is the main part of codes:

cudaStream_t uploadStream, downloadStream, computeStream;

    int bufNum = 0;

    int *pCPUbuf[3];

    int *pGPUbuf[3];

for (int i=0; i<20; i++) {

        HANDLE_ERROR(cudaMemcpyAsync( pGPUbuf[bufNum%3], pCPUbuf[(bufNum+1)%3],N * sizeof(int),cudaMemcpyHostToDevice,uploadStream ));

        HANDLE_ERROR(cudaMemcpyAsync( pGPUbuf[(bufNum+2)%3], pCPUbuf[(bufNum+2)%3],N * sizeof(int),cudaMemcpyDeviceToHost,downloadStream));  //error!!

        kernel<<<N/256,256,0,computeStream>>>( pGPUbuf[(bufNum+1)%3] );

//CPU computation stuff

        for (int i=0; i<N; i++) {

            *(pCPUbuf[bufNum]+i) = rand()%10000;

        }

cudaThreadSynchronize();

        bufNum++;

        bufNum %= 3;

}

HANDLE_ERROR is the marco test whether cuda function is successful or not. If failed, then HANDLE_ERROR will output the error message from cudaGetErrorString.

Does anyone else run into such problem?

Is there anything wrong with my implementation?

Thanks.