Memcpy3D fails in release Code runs in debug, not release

Just in case anyone has any ideas. My code runs fin in both emulator modes, and in debug. However, when I run it in release mode, the CudaMemcpy3D calls that are using streams fail (they work in the version without streams)! I could have sworn that they where working last week, and wonder if it is a windows update that is causing the problem, as I went back to my earlier code after I thought it was sorted to check as other pieces of code I am working on started failing in strange ways.

So I guess I have two questions.

  1. Has anyone started having problems with the CUDA memory transfers/allocations etc as a result of a system update?

  2. Can anyone spot what is wrong with the following piece of code?

    params->srcPtr.ptr = spatial_derivatives[0] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Ixy;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[1] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Ixz;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[2] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Iyz;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[3] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Ixt;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[4] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Iyt;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[5] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Izt;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[6] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Ix2;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[7] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Iy2;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

    params->srcPtr.ptr = spatial_derivatives[8] + jBLOCK_X + kBLOCK_Y*NX;
    params->dstPtr.ptr = d_Iz2;
    CUDA_SAFE_CALL( cudaMemcpy3DAsync(params, *stream) );

Many thanks


If anyone can post an example of cudaMemcpy3D that uses the offsets in the parameters structure I would appreciate it, they are ignored when I try using them.

I can confirm that I have this exact same problem. cudaMemcpy3DAsync fails, where cudaMemcpy3D runs just fine.

cudaMemcpyToArrayAsync works just fine.

This is on CUDA 2.0, with driver 178.28, Visual Studio 2005 Express on Windows XP 32 bit and with a GTX 280.


In the end it turned out that the problem was caused by me not using page locked memory for the Async version (not a requirement for the standard one). This memory is gained through cudaMallocHost. Also this memory has to be allocated in the thread it is going to be used in, and cannot be passed between threads by passing pointers.

Hope this helps