Texture memory fetch extremely slow

So I have a 4000x3000 image stored in texture memory as an unsigned short. When I access any of the data it slows down my kernel by ~7ms, even if I am not doing anything with the data.

I am accessing the data via:

ushort result = tex2D(texInImage, x, y);

Where

texture<ushort, 2, cudaReadModeElementType> texInImage;

Suggestions about how to debug this slowdown would be greatly apprecated.

Let me guess. You’re arriving at this conclusion based on commenting out various lines of code.

Yeah I am, which I am guessing is the wrong way to go about it.

The problem is the compiler is pretty smart about optimization. When you delete lines of code which directly or indirectly affect global state, it often means that other lines of code are no longer doing anything that impacts global state, and the compiler will detect that and delete that code as well. These unexpected/unintended side effects make the approach fraught with difficulty.

Many folks don’t realize the extent/capability of current optimizing compilers. It’s sometimes not possible to anticipate compiler behavior by looking at source code.

The best approach for optimization (IMO) is analysis-driven optimization. In a nutshell, this means using tools like profilers:

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#abstract

to identify performance issues with the code and then targetting fixes to address those issues. There’s definitely a learning curve associated with this. You can find presentations on this topic by googling, for example “gtc analysis driven optimization”

If you have something that you can reduce to a relatively short (but complete) example, you can probably get help here as well. However there’s not much that can be said if you only show one line of source code. Your conclusion that that line is the crux of the issue is likely not sound, IMO.

If you really want to validate the “commenting out” method, you really must analyze the impact on the actual generated machine code. This can be done using the CUDA binary utilities:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#abstract

but it’s somewhat tedious and has a learning curve associated with it as well.

I am assuming you are doing release builds since you know that measuring performance with a debug build is meaningless. The CUDA compiler is very effective in removing “dead” code. If, for example, you remove the storing of data to global memory, the compiler will optimize out all computation that previously fed into to stored data, unless that computation contributes to other data that is still being stored.

Probably the easiest way to assess how efficiently your kernel is using the available memory bandwidth is to use the Visual Profiler. It will also give a good indication what the bottlenecks are in your kernel, it may not even be the texture reads. Since you haven’t shown any code, it is impossible for us to know. The Best Practices Guide advises on strategies on how to maximize memory throughput in chapter 9.

I have not tried this with texture accesses, but one thing you might want to look into is to use wider loads, for example by mapping the texture as ushort4 (your texture dimensions suggest that this should be possible). Obviously that means that you would need to adjust your code to process four texels at a time, which may or may not be a trivial change. Also make sure that your accesses have the best possible locality, you may be able to adjust your traversal pattern to optimize this.

Lastly, you always want to set achieved bandwidth in relation to the maximum practically achievable bandwidth, which you can assume to be about 75% of the theoretical peak bandwidth. The bandwidth of low-end and high-end cards can differ by a factor of about 5, so that is something to keep in mind.

Thanks for the suggestions. I will check out chapter 9 and look at switching to ushort4.

I am using the visual debugger to gather this information, and I know the timing isn’t completely accurate due to optimizations etc. However if I use an arbitrary value vs getting data from the texture memory I experience a kernel execution time difference of ~7ms, or a change of ~1ms to ~8ms. Which seems extremely long considering I am accessing something from texture memory.

const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
const float x = (float)ix + 0.5f;
const float y = (float)iy + 0.5f;
// int to flip image due to camera vs gl coordinates
int maxImageSize = (imageW * imageH) - 1;

if (ix < imageW && iy < imageH)
{
    ushort result = tex2D(texInImage, x, y);
//    ushort result = 32768;
    // Convert 16 bit image to 8 so that it can be displayed
    unsigned char charResult = (unsigned char)(result >> 8);
    // Concatenate chars together to make a 4 channel pixel
    unsigned int colour =  make_color(charResult, charResult, charResult, charResult);
    // Set pixel
    dst[maxImageSize - (imageW * iy + ix)] = colour;
}

So I guess your claim is that if you write a kernel that reads from texture, and simply flips a 12Mp image, where each pixel is 16 bit (perhaps writing a 32-bit result), the read-from-texture is taking 7ms, vs. 1ms for everything else.

If you want to provide a complete code, that does that, and just that, I’ll take a look. It needs to be something I can copy, paste, compile, and run, without having to add anything or change anything. It shouldn’t require more than about 3x the number of lines of code you have shown so far.

Certainly I agree it should not take 7ms to read a 12Mp image. Depending on the GPU, it should probably be less than 1ms.

Note that using texture probably isn’t buying you much of anything for what you have shown so far. Caches normally provide benefit when there is data re-use. I don’t see any of that here. You appear to be reading each element once. Furthermore, your read patterns could nicely coalesce anyway, without using texture.

you might consider the technique of doing multiple elements per thread - ‘register blocking’ (and unrolling the occurring loops of course). Furthermore, the pattern then should be ‘read - (modify) - write’ where each step as mentioned handles multiple elements. The number of elements per thread should be a template parameter.

We have made good experience with that, leads to clean code and quite good performance.

See paper ‘OMMUNICATION-MINIMIZING 2D CONVOLUTION IN GPU …’ by Iandola et. al
and http://on-demand.gputechconf.com/gtc/2015/presentation/S5152-Hermann-Fuerntratt.pdf

You furthermore should consider also using the ‘cub’ library
http://nvlabs.github.io/cub/

Thanks for all of the suggestions.

Taking all of the suggestions into account as well as switching to the 64-bit (ushort4) read and the 32-bit (uchar8) write got my times down to ~700 us per kernel.

I have read through all of the links that were provided, and though I don’t think they will be extremely helpful for the simple thing I am currently doing, I’m assuming they will be when I start to do more complex processing on the image.

Hi,

I am trying to measure speed up using texture memory. Here(kbaseConvolution) I am performing a convolution operation via naive parallel implementation. This kernel takes 3.45ms time to run 640x360 image with a gaussian_kernel of size 11x11.

To further optimize and benefit from spatial locality provided by texture memory, I implemented a similar convolution using texture memory for gaussian_kernel of 11x11 and 640x360 image. However this kernel(kTextureMemConvolution) runs for 95.58ms. Why am I seeing performance downgrade ? I don’t understand this behavior of texture memory?

global void kbaseConvolution(float * i_gdata, float * i_gkernel, int kernelWidth, int nCols,int nRows, int kRadius, float * o_gdata)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;

    float value = 0.0;
int x0 = 0;
int y0 = 0;
int o_glo c= y * nCols + x;

    for (int j = -kRadius; j <= kRadius;j++) {
         y0 = y + j; 
	for (int i = -kRadius; i <= kRadius; i++) {
		x0 = x + i;
		if (y0 < 0 || x0 < 0 || y0 > nRows || x0 > nCols)
		{
			value += 0.0;
		}
		else {
			value += i_gdata[y0*nCols + x0] * i_gkernel[(j+kRadius)*kernelWidth + (i+kRadius)];
		}
	}
}
o_gdata[o_gloc] = value;

}

////// CONVOLUTION USING TEXTURE MEMORY //////

texture<float, 2, cudaReadModeElementType> g_InputTexture;
texture<float, 2, cudaReadModeElementType> g_GaussianKernel;

global void kTextureMemConvolution(int nRows, int nCols, int kRadius, float *o_gdata)
{
int x = IMAD(blockIdx.x, blockDim.x, threadIdx.x);
int y = IMAD( blockIdx.y, blockDim.y, threadIdx.y);
float value = 0;
int y0 = 0, x0 = 0;

for (int j = -kRadius; j <= kRadius; j++) {
	y0 = y + j;
	for (int i = -kRadius; i <= kRadius; i++) {
		x0 = x + i;
		if (y0 >= 0 || x0 >= 0 || y0 < nRows || x0 < nCols) {
			value += tex2D(g_InputTexture, x0 , y0 ) * tex2D(g_GaussianKernel, (i + kRadius) , (j + kRadius));
		}
	}
}

o_gdata[IMAD(y,nCols, x)] = (float) value;

}

The gaussian filter is separable, so you can do two passes (horizontal + vertical) of 1-D convolutions with a kernel of size 11. The convolution kernel coefficients should be kept in costant memory. It is not useful to compare speed of ‘naive’ implementations.

Thank you for replying !
I understand that a separable convolution performs better, however I was trying to understand performance gain using texture memory. Texture memory provides spatial locality which means that for a convolution operation which exhibits high spatial locality, texture cache should provide better or same performance as that of a convolution performed using L1 cache. Where as it is extremely slow in my case, am I missing some understanding of how texture works or when it should be used?

Performance comparison questions like this usually benefit from a more complete description, including these items:

  1. A complete, short test case. Want help? Don’t make others assemble scaffolding around the code you have shown. Make it easy for them to help you. Provide a short, complete code that someone else can copy, paste, compile, and run, and see the issue, without having to add anything or change anything.

  2. The platform you are running on: Operating system, GPU, CUDA version, driver version.

  3. The compile command line you are using to build the code.

  4. The methodology (if it’s not obvious) by which you arrived at your timing measurement.

I will say that a common error people make is to compile debug projects, or compile with the -G switch to create debuggable code, then attempt to do performance analysis. That is not recommended. Debug code can behave opposite to the way you expect, for performance. Here is a recent example:

https://stackoverflow.com/questions/47637048/why-cuda-shared-memory-is-slower-than-global-memory-in-tiled-matrix-multiplicati

@cpchinmai: It is unusual to put convolution kernels into texture memory, the repeated loading might be a problem for performance. Modify the second kernel so that ‘g_GaussianKernel’ is in constant memory and measure then the runtime. With a 11x11 nonseparable kernel, you might be already compute bound and might be better off using a high-performance implementation ‘GEMM’ routine or the tensor cores.

In my experience, using texture memory (instead of shared memory) in image processing routines works fine and leads to well readable code. I will have a GTC 2018 presentation on that topic (S8111 High-Performance Image Processing Routines for Video and Film Processing). In the GTC-on-demand archive, you will also find also several presentations throughout the years about that topic (optimized memory access in the context of image processing routines).