cudaMemcpy timeout on 64-bit, not on 32-bit

Hello everyone!

I have a problem copying data from device to host when running on 64-bit arch but the odd thing is that the exact same code works flawlessly when compiled to 32-bit. I’m afraid I’m not allowed to expose as much code as I want to (I know this is really problematic and Im sorry) but I will outline as much as possible. The array allocated like this

// Allocate

	isError = cudaMalloc(&d_SIMAPS, numel_SIMAPS_in_bytes);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS ALLOC");

	

	// Initialize memory to zeros

	isError = cudaMemset(d_SIMAPS, 0, numel_SIMAPS_in_bytes);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS MEMSET");

and filled by a kernel using

__global__ void cuSixel(float* d_SIMAPS, ...){

	unsigned int M = blockIdx.x;

	unsigned int N = threadIdx.x;

	 

	extern __shared__ float s_SIMAP[];

	s_SIMAP[N] = 0;

	[...]

		

	// Write sixel to global memory

	d_SIMAPS[M + N * *pitch] = s_SIMAP[N];

}

I’m copying to a pointer initialized by a matlab mex function

h_SIMAPS = (float*)mxGetPr(plhs[0] = mxCreateNumericMatrix(M, N, mxSINGLE_CLASS, mxREAL));

The error occurs when when I copy to results to host after kernel is invoked (kernel finishes normally)

/* INVOKE KERNEL */

	cuSixel <<< dimGrid , dimBlock , N >>> (d_SIMAPS, ...);	

/* GET RESULTS */

	// Copy to host

	isError = cudaMemcpy(h_SIMAPS, d_SIMAPS, numel_SIMAPS_in_bytes, cudaMemcpyDeviceToHost);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS TO HOST");

As I said, this returns a timeout error on 64 bit but not on 32 bit. Does anyone have experience with a similar issue?

Thanks

Edit: I use a 2009 Mac Pro, Snow Leopard, Matlab R2010b 64-bit, Matlab R2010a 32-bit and CUDA 3.1

Hello everyone!

I have a problem copying data from device to host when running on 64-bit arch but the odd thing is that the exact same code works flawlessly when compiled to 32-bit. I’m afraid I’m not allowed to expose as much code as I want to (I know this is really problematic and Im sorry) but I will outline as much as possible. The array allocated like this

// Allocate

	isError = cudaMalloc(&d_SIMAPS, numel_SIMAPS_in_bytes);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS ALLOC");

	

	// Initialize memory to zeros

	isError = cudaMemset(d_SIMAPS, 0, numel_SIMAPS_in_bytes);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS MEMSET");

and filled by a kernel using

__global__ void cuSixel(float* d_SIMAPS, ...){

	unsigned int M = blockIdx.x;

	unsigned int N = threadIdx.x;

	 

	extern __shared__ float s_SIMAP[];

	s_SIMAP[N] = 0;

	[...]

		

	// Write sixel to global memory

	d_SIMAPS[M + N * *pitch] = s_SIMAP[N];

}

I’m copying to a pointer initialized by a matlab mex function

h_SIMAPS = (float*)mxGetPr(plhs[0] = mxCreateNumericMatrix(M, N, mxSINGLE_CLASS, mxREAL));

The error occurs when when I copy to results to host after kernel is invoked (kernel finishes normally)

/* INVOKE KERNEL */

	cuSixel <<< dimGrid , dimBlock , N >>> (d_SIMAPS, ...);	

/* GET RESULTS */

	// Copy to host

	isError = cudaMemcpy(h_SIMAPS, d_SIMAPS, numel_SIMAPS_in_bytes, cudaMemcpyDeviceToHost);

	checkError(isError, numel_SIMAPS_in_bytes, "d_SIMAPS TO HOST");

As I said, this returns a timeout error on 64 bit but not on 32 bit. Does anyone have experience with a similar issue?

Thanks

Edit: I use a 2009 Mac Pro, Snow Leopard, Matlab R2010b 64-bit, Matlab R2010a 32-bit and CUDA 3.1