Avoiding a device write using textures and arrays.

Hello,

I have written an bunch of CUDA kernels for some of the basic mathematics operations that I run in my C simulations. The specific kernels that I have a question about utilize matrix operations. Depending on the kernel operation, I’ve used texture memory access or I’ve used a linear array in global memory (with coalescing), depending on which technique made any given kernel faster.

In addition, on the host side, I have my matrices and vectors stored in structures that keep track of whether or not the data has been modified on either the GPU or host side, and so will only allocate and memcpy the the data on GPU (or back to the host) if the logic says that the data is out of date, etc. Of course, this kind of thing is necessary to avoid the MASSIVE overhead involved in copying data back and forth from the GPU and the host.

So now I’ve run into a small practical problem that I would like to solve. Since some of my routines use textures and some use linear arrays in global memory, I have to keep these two peices of data on the GPU in sync whenever I update one or the other.

E.g. Here’s two operations that happen on the GPU.

A = A + rr^T (A is updated in linear memory)

k = A*r (this kernel uses A as a texture representation)

So you can see that I have to keep A in two formats, which is both wasteful in space and time. Is there any way to avoid this while still using kernels that utilize both global memory and texture representations of matrices (for the sake of kernel speed)? I.e. is it possible to point the texture reference to the linear memory representation or any other such hack? This is what I normally have:

A is the structure that holds the matrix and A->data is the matrix.

   if ( alloc_A_on_device )

    {

        CUDA_SAFE_CALL( cudaMallocPitch((void**) &A->devPtr, &pitch, A->size2*sizeof(float), A->size1) );

        /* Fold the matrix into a 4x1 texture. */

        CUDA_SAFE_CALL( cudaMallocArray( (cudaArray**)&A->devPtr2, &channelDescA, A->size2 / 4, A->size1 ));

    }

   if ( load_A_to_device )

    {

        CUDA_SAFE_CALL( cudaMemcpy(A->devPtr, A->data, mem_size_A, cudaMemcpyHostToDevice) );

        CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr, 

                                            A->size2 * sizeof(float), 

                                            A->size2 * sizeof(float), A->size1,

                                            cudaMemcpyDeviceToDevice));

       A->hostAndDevSynced = true;

    }

    CUDA_SAFE_CALL( cudaBindTextureToArray( texRefA, (cudaArray*)A->devPtr2, channelDescA));

<<<kernel>>>

   if ( A->keepOnDev )

    {

        CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr, 

                                            A->size2 * sizeof(float), A->size2 * sizeof(float), 

                                            A->size1, cudaMemcpyDeviceToDevice));

    }

So I’d like to avoid especially the last cudaMemcpy2DToArray, which syncs the texture array with the recently updated global memory representation of A.

Thank you,

-David

Well, you can bind 1D textures straight to device memory (cudaBindTexture, tex1Dfetch) but then you lose the advantage of the 2D cache which I’m guessing you need for your matrices.

There is unfortunately no way to avoid your cudaMemcpy2DToArray. How slow is it for you? Or is it the wasted memory that is the problem? CUDA 1.1 has a bug that makes cudaMemcpyToArray very slow. The performance improved in 2.0b1, but not hugely. NVIDIA said they were working on improving it more, but I haven’t checked it yet in 2.0b2.

As MisterAnderson42 says it is not possible to view the same data as a 2D texture and a linear array. Fundamentally this is because the data in a 2D texture is not stored in a linear format.

It has always seemed strange to me that NVIDIA refuse to disclose the details of the data layout in 2D textures. Presumably its easy enough to determine it by writing a known pattern to a texture and reading it using global pointers in CUDA? Unless of course its impossible to even find the starting address but even then it might be possible to put some “magic” values in and just search the entire device memory space?

However, I can’t really see it being worth the effort in most cases.

It’s the time, although I’m sure I’d like to save the space, too.

For A-2048x2048 in A = A + rr^T:

With the cudaMemcpy2DToArray

                     total func time     kernel time

SYR: Processing time:	4.911000 (ms) :  1.540000 (ms) 

without

SYR: Processing time:	1.534000 (ms) :  1.532000 (ms)

So really the memcpy has to be taken into account when I try to figure out whether kernels are “faster” with textures or global shared memory.

Thanks,

-David