Image Denoise Question About CUDA sampel "imageDenoisigng"

Hi.

Why NLM2 the method in CUDA sampel “imageDenoisigng” works faster than NLM? As far as I understand there is a difference only that in NLM2 weights of the filter are calculated and writes to the shared memory, and in NLM such was not present. I can’t understand why NLM2 works faster than NLM (As far as I understand, NLM2 and NLM have identical algorithmic intensity). ??

In documentation it is written: “Quick NLM has one additional parameter – the block of pixels that share weights. This parameter is crucial in speeding NLM.” So, why "this parameter is crucial "?

In NLM each color distance between two pixels is derived from color distance between square blocks of 7 x 7, surrounding the pixels.

So we have this innermost loop

//Find color distance from (x, y) to (x + j, y + i)

for(float n = -NLM_BLOCK_RADIUS; n <= NLM_BLOCK_RADIUS; n++)

    for(float m = -NLM_BLOCK_RADIUS; m <= NLM_BLOCK_RADIUS; m++)

        weightIJ += ...

NLM2 precomputes color weights once per 8x8 block of the output and writes them to shared memory.

The innermost loop (above) turns into

//Load precomputed weight

float weightIJ = fWeights[idx++];

So, the number of texture fetches, per output pixel is greatly reduced.

As you can see the results are not identical, but it works. :)