Avoiding a device write using textures and arrays.

sussillo · August 7, 2008, 1:53pm

Hello,

I have written an bunch of CUDA kernels for some of the basic mathematics operations that I run in my C simulations. The specific kernels that I have a question about utilize matrix operations. Depending on the kernel operation, I’ve used texture memory access or I’ve used a linear array in global memory (with coalescing), depending on which technique made any given kernel faster.

In addition, on the host side, I have my matrices and vectors stored in structures that keep track of whether or not the data has been modified on either the GPU or host side, and so will only allocate and memcpy the the data on GPU (or back to the host) if the logic says that the data is out of date, etc. Of course, this kind of thing is necessary to avoid the MASSIVE overhead involved in copying data back and forth from the GPU and the host.

So now I’ve run into a small practical problem that I would like to solve. Since some of my routines use textures and some use linear arrays in global memory, I have to keep these two peices of data on the GPU in sync whenever I update one or the other.

E.g. Here’s two operations that happen on the GPU.

A = A + rr^T (A is updated in linear memory)

k = A*r (this kernel uses A as a texture representation)

So you can see that I have to keep A in two formats, which is both wasteful in space and time. Is there any way to avoid this while still using kernels that utilize both global memory and texture representations of matrices (for the sake of kernel speed)? I.e. is it possible to point the texture reference to the linear memory representation or any other such hack? This is what I normally have:

A is the structure that holds the matrix and A->data is the matrix.

   if ( alloc_A_on_device )

    {

        CUDA_SAFE_CALL( cudaMallocPitch((void**) &A->devPtr, &pitch, A->size2*sizeof(float), A->size1) );

        /* Fold the matrix into a 4x1 texture. */

        CUDA_SAFE_CALL( cudaMallocArray( (cudaArray**)&A->devPtr2, &channelDescA, A->size2 / 4, A->size1 ));

    }

   if ( load_A_to_device )

    {

        CUDA_SAFE_CALL( cudaMemcpy(A->devPtr, A->data, mem_size_A, cudaMemcpyHostToDevice) );

        CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr, 

                                            A->size2 * sizeof(float), 

                                            A->size2 * sizeof(float), A->size1,

                                            cudaMemcpyDeviceToDevice));

       A->hostAndDevSynced = true;

    }

    CUDA_SAFE_CALL( cudaBindTextureToArray( texRefA, (cudaArray*)A->devPtr2, channelDescA));

<<<kernel>>>

   if ( A->keepOnDev )

    {

        CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr, 

                                            A->size2 * sizeof(float), A->size2 * sizeof(float), 

                                            A->size1, cudaMemcpyDeviceToDevice));

    }

So I’d like to avoid especially the last cudaMemcpy2DToArray, which syncs the texture array with the recently updated global memory representation of A.

Thank you,

-David

MisterAnderson42 · August 7, 2008, 2:02pm

Well, you can bind 1D textures straight to device memory (cudaBindTexture, tex1Dfetch) but then you lose the advantage of the 2D cache which I’m guessing you need for your matrices.

There is unfortunately no way to avoid your cudaMemcpy2DToArray. How slow is it for you? Or is it the wasted memory that is the problem? CUDA 1.1 has a bug that makes cudaMemcpyToArray very slow. The performance improved in 2.0b1, but not hugely. NVIDIA said they were working on improving it more, but I haven’t checked it yet in 2.0b2.

shawkie · August 7, 2008, 2:35pm

As MisterAnderson42 says it is not possible to view the same data as a 2D texture and a linear array. Fundamentally this is because the data in a 2D texture is not stored in a linear format.

It has always seemed strange to me that NVIDIA refuse to disclose the details of the data layout in 2D textures. Presumably its easy enough to determine it by writing a known pattern to a texture and reading it using global pointers in CUDA? Unless of course its impossible to even find the starting address but even then it might be possible to put some “magic” values in and just search the entire device memory space?

However, I can’t really see it being worth the effort in most cases.

sussillo · August 7, 2008, 2:40pm

It’s the time, although I’m sure I’d like to save the space, too.

For A-2048x2048 in A = A + rr^T:

With the cudaMemcpy2DToArray

                     total func time     kernel time

SYR: Processing time:	4.911000 (ms) :  1.540000 (ms) 

without

SYR: Processing time:	1.534000 (ms) :  1.532000 (ms)

So really the memcpy has to be taken into account when I try to figure out whether kernels are “faster” with textures or global shared memory.

Thanks,

-David

Topic		Replies	Views
Question about texture memory CUDA Programming and Performance	3	4443	May 27, 2009
Texture / Array Access CUDA Programming and Performance	20	21444	April 19, 2008
CUDA vs. OpenGL textures read-only vs. read-write CUDA Programming and Performance	5	10926	March 30, 2007
CUDA texture memory performance CUDA Programming and Performance	4	33501	January 13, 2009
Memory performance in image processing example CUDA Programming and Performance	9	1599	March 24, 2011
Textures CUDA Programming and Performance	2	1614	July 22, 2008
Using Textures CUDA Programming and Performance	10	21765	March 29, 2007
Overlap of Data Transfer and Kernel Execution CUDA Programming and Performance	3	1378	March 4, 2011
Array Problem CUDA Programming and Performance	5	1462	January 27, 2010
How would you do this? CUDA Programming and Performance	12	4464	August 5, 2008

Avoiding a device write using textures and arrays.

Related topics