3D Texture and memory writes Write memory bound to 3D texture

I am picking up on a topic posted in Oct 2009 with no answers (sphyerion, post Oct 28 2009)

I want to use a 3D texture (access via tex3D()) and also write to the data array in separate kernel.
The reason is that the arrays are huge (~1 Gb) and I cannot have a second copy just to copy from device to device or would have to make piecewise copies (possible, but nasty code).

In the “Best Practices Guide 3.1” I find the following statement:

“That is, a thread can safely read a memory location via texture if the location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread within the same kernel call. This is relevant only when fetching from linear or pitch-linear memory because a kernel cannot write to CUDA arrays.”

However it seems I cannot bind a 3D texture to anything other than CUDA arrays (using cudaMalloc3DArray() instead of cudaMalloc3D())-
Is this really a restriction of 3D textures? Or am I missing something ?

Rolf

You’re correct, currently you can only bind 3D textures to 3D arrays, and you can’t write to 3D arrays from kernels.

CUDA 3.1 adds surfaces, which allow writing to 2D textures (at least on Fermi hardware), but this doesn’t support 3D textures yet. 3D surface writes will be added in a future release.

So the only solution today is to either:

  • write to global memory and then use cudaMemcpy3D() to copy to the array
  • use global loads and do your own interpolation (perhaps using tex1Dfetch() to cache the reads).

You’re correct, currently you can only bind 3D textures to 3D arrays, and you can’t write to 3D arrays from kernels.

CUDA 3.1 adds surfaces, which allow writing to 2D textures (at least on Fermi hardware), but this doesn’t support 3D textures yet. 3D surface writes will be added in a future release.

So the only solution today is to either:

  • write to global memory and then use cudaMemcpy3D() to copy to the array
  • use global loads and do your own interpolation (perhaps using tex1Dfetch() to cache the reads).

thanks for the clarification.

Actually I was not 100% clear in my original post: Indeed I need both reading (via textures) and writing access to the data in different kernels and would have liked to have just one memory space allocated.

In addition to what Simon wrote:

You can very well copy data directly from the host to a cudaArray on the device:

cudaMemcpy3DParms aParms = {0};

    aParms.srcPtr = make_cudaPitchedPtr ((float *)image, IPar->dim.x*sizeof(float), IPar->dim.x, IPar->dim.y);

    aParms.dstArray = dtexImg3DData;

    aParms.extent = imgExtent;

    aParms.kind =   cudaMemcpyHostToDevice;

    CUDA_SAFE_CALL (cudaMemcpy3D (&aParms));

so for pure copying there is no need for additional memory, however, unfortunately it does not fully solve my problem.