Filter method by using texture memory

I’ve conducted research on optimizing filter methods, such as the Sobel filter. Initially, I believed that the most efficient approach would involve utilizing constant memory to store the filter mask and shared memory for storing image data. However, upon examining the code in the GitHub repository link, it seems that the code may involve a process of copying data from the host to texture memory and then from texture memory to shared memory before performing computations.

This has raised a question: which approach is preferable for implementing filters like Sobel?

  1. Uploading data from the host to texture memory and then to shared memory.
  2. Uploading data from the host to global memory and then to shared memory.

I’m seeking guidance on the most effective approach to take for implementing image processing filters, such as the Sobel filter, in CUDA.

Did you try them both? Which method was faster?

Texture memory is (mainly) a caching system. If you are only reading the data once, in an orderly fashion, then texture (or any cache) is unlikely to provide benefit.

Of course, many filtering systems do involve multiple reads of the same data item. Therefore, caches can be a benefit. But the shared memory approach used with many filtering methods provides the primary caching benefit for data reuse.

Therefore, I would study the code. If the process of getting the data into the shared memory only involves 1 read or load operation, and thereafter subsequent accesses are serviced entirely from shared memory, then any caching system is unlikely to provide benefit.

Many shared filtering methods involve a border region, that often must be loaded into two separate shared memory patches. So the load from global may indeed involve 2 loads, thus a cache of some sort may be of some benefit. However modern CUDA GPUs include a L2 cache, so it’s possible the benefit of the texture “cache” system over the “ordinary” L2 cache may be quite limited.

The CUDA samples serve several purposes, but the primary purpose is often not to answer the question “what is the fastest way to do X”, but instead, to answer a question like “how could I make sensible use of the texture system as a cache for a filtering application”. Answering the second question does not necessarily imply that texture is faster than ordinary L2 caching in all cases.

I haven’t directly experimented with filters, but I did conduct some experiments related to them. Here’s the code:

#include <cuda_runtime.h>
#include <iostream>

__global__ void transformKernelTexture(float *output, cudaTextureObject_t texObj, int width, int height, float theta)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
    float u = x / static_cast<float>(width);
    float v = y / static_cast<float>(height);
    u -= 0.5f;
    v -= 0.5f;
    float tu = u * cosf(theta) - v * sinf(theta) + 0.5f;
    float tv = v * cosf(theta) + u * sinf(theta) + 0.5f;
    output[y * width + x] = tex2D<float>(texObj, tu, tv);
}

__global__ void transformKernelNoTexture(float *output, int width, int height, float theta)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
    float u = x / static_cast<float>(width);
    float v = y / static_cast<float>(height);
    u -= 0.5f;
    v -= 0.5f;
    float tu = u * cosf(theta) - v * sinf(theta) + 0.5f;
    float tv = v * cosf(theta) + u * sinf(theta) + 0.5f;
    output[y * width + x] = tu * tv;
}

int main()
{
    const int height = 1024;
    const int width = 1024;
    float angle = 0.5;
    float *h_data = static_cast<float *>(std::malloc(sizeof(float) * width * height));

    for (int i = 0; i < height * width; ++i)
    {
        h_data[i] = static_cast<float>(i);
    }

    cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray_t cuArray;
    cudaMallocArray(&cuArray, &channelDesc, width, height);
    const size_t spitch = width * sizeof(float);
    cudaMemcpy2DToArray(cuArray, 0, 0, h_data, spitch, width * sizeof(float), height, cudaMemcpyHostToDevice);
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeWrap;
    texDesc.addressMode[1] = cudaAddressModeWrap;
    texDesc.filterMode = cudaFilterModeLinear;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 1;
    cudaTextureObject_t texObj = 0;
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, nullptr);

    float *outputTexture, *outputNoTexture;

    cudaMalloc(&outputTexture, width * height * sizeof(float));
    cudaMalloc(&outputNoTexture, width * height * sizeof(float));

    dim3 threadsperBlock(32, 32);
    dim3 numBlocks((width + threadsperBlock.x - 1) / threadsperBlock.x, (height + threadsperBlock.y - 1) / threadsperBlock.y);

    transformKernelTexture<<<numBlocks, threadsperBlock>>>(outputTexture, texObj, width, height, angle);

    transformKernelNoTexture<<<numBlocks, threadsperBlock>>>(outputNoTexture, width, height, angle);

    cudaDestroyTextureObject(texObj);
    cudaFreeArray(cuArray);
    cudaFree(outputTexture);
    cudaFree(outputNoTexture);
    free(h_data);

    return 0;
}

Upon profiling with tools like Nsight System and Nsight Compute, I observed that the method without utilizing textures is faster. However, what perplexes me is the prevalent use of textures when working with filters. This observation was made evident through my use of Nsight System, which revealed that OpenCV in C++ with CUDA employs functions like cudaMallocPitch and cudaMemcpy2D, and I also noticed similar usage in CUVI due to the presence of parameters related to “Pitch.”

This prompts me to wonder whether textures possess a significantly greater performance advantage that I might not be aware of.