Hello everyone!
I have a problem copying data from device to host when running on 64-bit arch but the odd thing is that the exact same code works flawlessly when compiled to 32-bit. I’m afraid I’m not allowed to expose as much code as I want to (I know this is really problematic and Im sorry) but I will outline as much as possible. The array allocated like this
// Allocate
isError = cudaMalloc(&d_SIMAPS, numel_SIMAPS_in_bytes);
checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS ALLOC");
// Initialize memory to zeros
isError = cudaMemset(d_SIMAPS, 0, numel_SIMAPS_in_bytes);
checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS MEMSET");
and filled by a kernel using
__global__ void cuSixel(float* d_SIMAPS, ...){
unsigned int M = blockIdx.x;
unsigned int N = threadIdx.x;
extern __shared__ float s_SIMAP[];
s_SIMAP[N] = 0;
[...]
// Write sixel to global memory
d_SIMAPS[M + N * *pitch] = s_SIMAP[N];
}
I’m copying to a pointer initialized by a matlab mex function
h_SIMAPS = (float*)mxGetPr(plhs[0] = mxCreateNumericMatrix(M, N, mxSINGLE_CLASS, mxREAL));
The error occurs when when I copy to results to host after kernel is invoked (kernel finishes normally)
/* INVOKE KERNEL */
cuSixel <<< dimGrid , dimBlock , N >>> (d_SIMAPS, ...);
/* GET RESULTS */
// Copy to host
isError = cudaMemcpy(h_SIMAPS, d_SIMAPS, numel_SIMAPS_in_bytes, cudaMemcpyDeviceToHost);
checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS TO HOST");
As I said, this returns a timeout error on 64 bit but not on 32 bit. Does anyone have experience with a similar issue?
Thanks
Edit: I use a 2009 Mac Pro, Snow Leopard, Matlab R2010b 64-bit, Matlab R2010a 32-bit and CUDA 3.1