Hello,
I have written an bunch of CUDA kernels for some of the basic mathematics operations that I run in my C simulations. The specific kernels that I have a question about utilize matrix operations. Depending on the kernel operation, I’ve used texture memory access or I’ve used a linear array in global memory (with coalescing), depending on which technique made any given kernel faster.
In addition, on the host side, I have my matrices and vectors stored in structures that keep track of whether or not the data has been modified on either the GPU or host side, and so will only allocate and memcpy the the data on GPU (or back to the host) if the logic says that the data is out of date, etc. Of course, this kind of thing is necessary to avoid the MASSIVE overhead involved in copying data back and forth from the GPU and the host.
So now I’ve run into a small practical problem that I would like to solve. Since some of my routines use textures and some use linear arrays in global memory, I have to keep these two peices of data on the GPU in sync whenever I update one or the other.
E.g. Here’s two operations that happen on the GPU.
A = A + rr^T (A is updated in linear memory)
k = A*r (this kernel uses A as a texture representation)
So you can see that I have to keep A in two formats, which is both wasteful in space and time. Is there any way to avoid this while still using kernels that utilize both global memory and texture representations of matrices (for the sake of kernel speed)? I.e. is it possible to point the texture reference to the linear memory representation or any other such hack? This is what I normally have:
A is the structure that holds the matrix and A->data is the matrix.
if ( alloc_A_on_device )
{
CUDA_SAFE_CALL( cudaMallocPitch((void**) &A->devPtr, &pitch, A->size2*sizeof(float), A->size1) );
/* Fold the matrix into a 4x1 texture. */
CUDA_SAFE_CALL( cudaMallocArray( (cudaArray**)&A->devPtr2, &channelDescA, A->size2 / 4, A->size1 ));
}
if ( load_A_to_device )
{
CUDA_SAFE_CALL( cudaMemcpy(A->devPtr, A->data, mem_size_A, cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr,
A->size2 * sizeof(float),
A->size2 * sizeof(float), A->size1,
cudaMemcpyDeviceToDevice));
A->hostAndDevSynced = true;
}
CUDA_SAFE_CALL( cudaBindTextureToArray( texRefA, (cudaArray*)A->devPtr2, channelDescA));
<<<kernel>>>
if ( A->keepOnDev )
{
CUDA_SAFE_CALL( cudaMemcpy2DToArray((cudaArray*)A->devPtr2, 0, 0, A->devPtr,
A->size2 * sizeof(float), A->size2 * sizeof(float),
A->size1, cudaMemcpyDeviceToDevice));
}
So I’d like to avoid especially the last cudaMemcpy2DToArray, which syncs the texture array with the recently updated global memory representation of A.
Thank you,
-David