Accelerated Filtering

Hi, I have some questions about NVIDIA GPUs:

Is bilinear filtering via OpenCL’s read_image{f,h} really any faster than loading four pixels and interpolating them in the shader code yourself?

Which is faster, OpenCL’s read_imageh or CUDA’s tex2D

What are the fastest hardware accelerated filtering options available on GeForce GPUs? Are there any faster filtering methods besides nearest neighbor?

What are the fastest formats to filter?
I was planning to use RGBA FP16 but am open to using faster ones.

Bilinear filtering in the texture units is a bit less accurate than doing it yourself.
You can create a fast program with manual filtering computations.

Some more advanced details:

The question is, if you need your computation power for something else. Most probably not, as current GPUs have so much computational power and bilinear filtering is rather simple.

You should avoid divisions for normalizations, e.g. use multiplication with precomputed reciprocal. If this is not possible, do the calculations as fixed comma values.

The texture units use a different path from L2 memory and less values have to be sent. Few programs could profit from it. You could also use __ldg() for the other datapath or PTX instruction tld4 for loading all four values.

To your other questions:
Trilinear filtering (= linear in 3D) is probably the most advanced one available in the texture units. You can use the tensor cores for a convolution with quite complex filters. You can somehow call this hardware accelerated.

Accessing textures should have a similar speed between OpenCL or CUDA, I would guess, but I do not know details.

What are the fastest formats? FP16 can be computed by only some of the engines, including the texture engine and the tensor cores, otherwise you would perhaps need conversions. On the other hand it has less memory size.

Without some more background how many RGBA pixels you intend to interpolate per second and whether they have random coordinates or e.g. you are resampling an image in a systematic manner, it is difficult to help you. I would guess, device memory bandwidth will be your limiting factor. So if your GPU can do 500 GB/s. With RGBA FP16 you have 8 bytes per pixel. Let us assume of the 4 needed source pixels 1 has to be read, 3 are in the cache and 1 has to be written, then you can process 31.25 billion pixels per second. And in the example it would not matter, whether you do bilinear or bicubic or lanczos filtering.

You have to say, whether that is enough or you have to squeeze out more performance.

Thank you for your answer.

Correct me if I am wrong, but the gpus (peak floating point ops per cycle) would outshine the number of bilinear samples*8 the texture unit gets per cycle right? I multiplied by 8 because I would consider 1 bilinear sample to be effectively 8 flops (4 muls, 4 adds). Would it be faster to batch the computations into some vectorized workload?

If this is the case why have a texture unit at all. Why not remove them in the next GeForce and use the chip area for something else, like more local memory or more FP units?

Accuracy is not too important to me. The picture just has to look bilinear interpolated. Assume that I am performing a standard interpolation all across the image. You are right, this should be memory bound, but say at each 2x2 window, I had to perform x interpolations (imagine i dont have to write these interpolations out for some reason as that would increase effective bandwidth). Would there be a certain x where using the texture unit beats just performing manual interpolations with the optimizations you mentioned?

Sorry for the long response, one last thing:

Below is an excerpt from this site by NVIDIA about the Tegra GPU. Their use of the word “free” when talking about bilinear filtering from within a single mip level leads me to think that it is quite fast. Also what is meant by “always generate mip map chains”. Arent mip maps just the same picture just downscaled? How would extra image to

“”"
Texture Filtering
The Tegra fragment unit incorporates sophisticated hardware texture sampling features such as bilinear, trilinear and anisotropic filter modes. In some cases, using these features has a performance cost, and application developers should take care to understand the impact of their use.

Always generate mipmap chains, and enable an appropriate mipmap-selecting filter mode, for texture assets. Use of texture samplers without mipmapping tends to result in unpredictable memory access patterns, which defeat the texture cache, reduces memory efficiency, and increases memory bandwidth consumption, with a subsequent reduction in performance.
“Bilinear” minification filter modes (i.e., those that sample from within a single mip-level only); GL_NEAREST_MIPMAP_NEAREST and GL_LINEAR_MIPMAP_NEAREST, are “free” and can be executed at 1 sample per clock, per fragment unit.
“”"

It has been many years since I last looked at this, but from vague memory, is it not something like 8 FMAs, 1 reciprocal, and 3 further (+,*) operations?

Texture interpolation is heavily used in classical 3D graphics. It stands to reason that the hardware texture unit that performs the interpolation in low-precision 1.8 bit fixed-point arithmetic requires less power than performing the equivalent series of discrete operations at full precision. Given that GPU power draw already is an issue, retaining hardware-based filtering for 3D graphics therefore seems an advantageous choice for now.

Compute applications often require finer interpolation granularity than what the hardware filtering provides. I seem to recall high-resolution computed tomography as one such use case from discussions in this forum. In that context, using the discrete FP32 equivalent for accuracy / quality reasons already made sense around 2015 or so, with a small-ish impact on performance at that time; I think the “performance penalty” might have been something like 20%, but my memory is very hazy.

It think it is possible that for CUDA applications any performance advantage from using bilinear hardware interpolation has practically disappeared with the latest GPU architectures, but I have not experimentally confirmed that this is the case. If you decide to perform such experiments, I would be interested in reading about the results.

1 Like

I’ve heard that some GPUs have special pipelines for texture memory.

What if you performed floating point operations on the interpolated texels. Say, I wanted to do another more sophisticated interpolation step after bilinear interpolation. How would the FP unit pipeline interact with the TMU pipeline?

Also, do you know if opencls write_image{f,h} uses the TMUs render unit?
I’m not sure if this is the case, I would guess its bad for performance since the fill rates would upper bound performance.

My suggestion would be to prototype some of the design alternatives you are contemplating and profile them on the intended hardware target (or substantially similar one). This will provide you with a lower bound on the expected performance and provide some good insights as to what the bottlenecks are going to be.

You are correct that the trajectory of GPU development over the past decade provided faster growth in compute throughput than memory throughput, leading to an increase in the number of applications whose performance is bound by memory bandwidth. See also roofline analysis.

In regards to the 1 reciprocal @njuffa mentioned:

You need the reciprocal of the target image size, when resampling. The coordinates are
srcx = targetx * srcwidth / targetwidth

The coefficients for the linear sampling can be directly taken from the srcx and its fractional component:
(1.f - fraction) * [floor(srcx)] + fraction * [floor(srcx) + 1]

Depending on your application, this reciprocal is a constant, can be done once per kernel call (e.g. on the CPU or once in each CUDA block) or has to be done with each access.

Bilinear interpolation just multiplies the coefficients and combines the coordinates for all 4 pixels.

The computation speed of current GPUs is more than enough (if they do not need the computation power for something else concurrently). Except for special circumstances you will not gain any speed-up. If it is important in your use case, try it out, by all means.

About the TMU memory pipeline.

This presentation Sci-Hub | RTX ON – The NVIDIA TURING GPU. 2019 IEEE Hot Chips 31 Symposium (HCS) | 10.1109/HOTCHIPS.2019.8875651 from the hot chips conference (slide 7) gives you a hint for the Pascal and Turing generation.

In earlier architectures, texture and global data took the same path, now shared memory and global data take the same path, and the texture unit is separate.

More or less the texture units kept their speed and functionality, and everything else was improved.

So how would I even find out whether manual linear interpolation is better than using the interpolation hardware in terms of speed?

If I write a kernel that just does the one interpolation, its going to be memory bound and the times are going to look the same.

Sure, I could adjust for speedup due to cache by having the manual kernel read via nearest neighbor?

But the first issue is still a problem for figuring out whether the HW filtering is fast enough.

By prototyping it in the context of your particular use case.

I’m not writing some application at the moment of that sort.

I am trying to find out which is faster on my hardware. But the issue I am having is that a kernel that just does 1 interpolation is memory bound, so the time difference is not clear.

I don’t think running the interpolation multiple times to amortize memory access time is fair to the hw interpolation variant because that means multiple arbitrary loads that would not happen outside of my profiling setup. Or maybe that’s not be a big deal because it’s in l1 texture cache?

Set up an experiment. Measure the execution time of your kernel variants and profile them. Refer to back to the documentation and build a mental model around your observations. I promise you will learn a lot with this approach. Generally speaking, HPC software engineering is not a discipline that relies much on thought experiments.

Okay, so here’s the bilinear interpolation kernel:

each work-item creates WI_SIZE_X * WI_SIZE_Y output elements. Work group dimensions are WG_SIZE_X x WG_SIZE_Y.

input image is H x W, output is (H - 1) x (W - 1). Is this how bilinear interpolation is typically used?

The good news is that I achieve peak memory bandwidth for CL_UNORM_INT8 when I set WI_SIZE_X=1 and WI_SIZE_Y=3.
Why is that the only setting that gets peak memory bandwidth?
My guess is memory coalescing + L1 cache. Each work item computes a 3x1 column for the output which requires a 4x2 input patch. Those 4x2 input patches are right next to each other, so I think there’d be like one cache miss, then once all the data is loaded into L1, the HW would perform the interpolations at whatever throughput the gpu gets (texel fill rate). I think it computing a 3x1 output column vs a 1x3 output row is just easier to coalesce.

This is just my intuition. Please let me know if I am not correct about something.

My questions:
Each interpolation must involve fetching a 2x2 window into L1 cache correct?

kernel code

#define REP(i, n) for(int i = 0;i < (n);++i)

kernel void interpolate2(read_only image2d_t input, write_only image2d_t output) {
    const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_NONE | CLK_FILTER_LINEAR;
    int2 ids = (int2)(WI_SIZE_X * (int)get_global_id(0), WI_SIZE_Y * (int)get_global_id(1));

    float2 coords = (float2)(get_global_id(0), get_global_id(1)) + (float2)(1.0f, 1.0f);

    #pragma unroll WI_SIZE_X
    REP(i, WI_SIZE_X) {
        #pragma unroll WI_SIZE_Y
        REP(j, WI_SIZE_Y) {
            half4 pixel = read_imageh(input, sampler, coords + (float2)((float)(i), (float)(j)));
            write_imageh(output, ids + (int2)(i, j), pixel);
        }
    }
}

host code (not complete)

#define W (112 * 8)
#define H (112)

#define WG_SIZE_X 8
#define WG_SIZE_Y 8

#define WI_SIZE_X 1
#define WI_SIZE_Y 3

typedef struct _IMG {
    cl_image_format image_format;
    cl_image_desc image_desc;
    cl_mem mem;
} IMG;

...

    IMG A = {{CL_RGBA, CL_UNORM_INT8}, {CL_MEM_OBJECT_IMAGE2D, W, H, 1}};
    A.mem = clCreateImage(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, &A.image_format, &A.image_desc, NULL, &ret);
    mem_objects.push_back(A.mem);
    ASSERT(ret == CL_SUCCESS);

    IMG A_interp = {{CL_RGBA, CL_UNORM_INT8}, {CL_MEM_OBJECT_IMAGE2D, W - 1, H - 1, 1}};
    A_interp.mem = clCreateImage(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, &A_interp.image_format, &A_interp.image_desc, NULL, &ret);
    mem_objects.push_back(A_interp.mem);
    ASSERT(ret == CL_SUCCESS);

    cl_kernel interpolate_kernel = clCreateKernel(program, "interpolate2", &ret);
    ASSERT(ret == CL_SUCCESS);
    ret = clSetKernelArg(interpolate_kernel, 0, sizeof(cl_mem), (void *)&A.mem);
    ret = clSetKernelArg(interpolate_kernel, 1, sizeof(cl_mem), (void *)&A_interp.mem);

    globalws = {(A_interp.image_desc.image_width) / WI_SIZE_X, (A_interp.image_desc.image_height) / WI_SIZE_Y};
    localws = {WG_SIZE_X, WG_SIZE_Y};

    ret = clEnqueueNDRangeKernel(queue, interpolate_kernel, 2, NULL, globalws.data(), localws.data(), 0, NULL, my_event);

do you know if the submit to start time should be factored in memory bandwidth calculations or just start to end. I’m using opencl just for testing as I’m working with lots of gpus, but will use cuda later for the NVIDIA ones.

opencl execution cycle:
queue → submit → start → end → complete

I’ve been measuring just start to end for my effective bandwidth calculations, and the results seem plausible.

But I did an FP16 bilinear interpolation test that showed memory bandwidth beyond the theoretical peak, so now I am not so sure.