(Solved)
Hi All:
I am a new comer and want to consult one question related to overlap. I tried to follow the method in:
Overlapping GPU<–>CPU copy and GPU by using buffer and cudaMemcpyAsync, CPU computation in order to improve performance.
However, I ran into one running error: “memory size of pointer value too large to fit in 32 bit in file *** in line ***”
Here is the main part of codes:
cudaStream_t uploadStream, downloadStream, computeStream;
int bufNum = 0;
int *pCPUbuf[3];
int *pGPUbuf[3];
for (int i=0; i<20; i++) {
HANDLE_ERROR(cudaMemcpyAsync( pGPUbuf[bufNum%3], pCPUbuf[(bufNum+1)%3],N * sizeof(int),cudaMemcpyHostToDevice,uploadStream ));
HANDLE_ERROR(cudaMemcpyAsync( pGPUbuf[(bufNum+2)%3], pCPUbuf[(bufNum+2)%3],N * sizeof(int),cudaMemcpyDeviceToHost,downloadStream)); //error!!
kernel<<<N/256,256,0,computeStream>>>( pGPUbuf[(bufNum+1)%3] );
//CPU computation stuff
for (int i=0; i<N; i++) {
*(pCPUbuf[bufNum]+i) = rand()%10000;
}
cudaThreadSynchronize();
bufNum++;
bufNum %= 3;
}
HANDLE_ERROR is the marco test whether cuda function is successful or not. If failed, then HANDLE_ERROR will output the error message from cudaGetErrorString.
Does anyone else run into such problem?
Is there anything wrong with my implementation?
Thanks.