Is there any tutorials for nppiSSIM_8u_C1R?

Hello, NPP developer.

I’m looking for some tutorials of nppiSSIM_8u_C1R function.
The CUDA samples don’t have any SSIM statistics example.

Could I get some example for nppiSSIM_8u_C1R function?
Also, Is nppiSSIM_8u_C1R supported RGB24 format?

Thank you.

The function computes the Structural Similarity Index (SSIM) between regions of interest in two 8-bit unsigned single-channel images. The calculation should be similar to what is described here:

https://en.wikipedia.org/wiki/Structural_similarity

RGB24 format would be something like 8u_C3R and there is no variant of the function like that. You might be able to treat a 3-channel interleaved image format like RGB24 as if it were a single channel 8-bit unsigned image (with 3 times the width) for the purpose of the calculation, but I don’t know for certain.

Here is a worked example:

$ cat t314.cu
#include <nppi.h>
#include <iostream>
#include <assert.h>
#include <cstdlib>

const int idim = 512;
const int sp1 = 10;
const int sp2 = 20;

int main(){

Npp8u *pSrc1, *pSrc2, *pDeviceBuffer, *hpSrc1, *hpSrc2;
  int   nSrc1Step, nSrc2Step, hBufferSize;
  NppiSize oSizeROI;
  Npp32f *pSSIM, hSSIM;
  NppStatus stat;

  // allocate
  hpSrc1 = new Npp8u[idim*idim];
  hpSrc2 = new Npp8u[idim*idim];
  cudaMalloc(&pSrc1, idim*idim*sizeof(Npp8u));
  cudaMalloc(&pSrc2, idim*idim*sizeof(Npp8u));
  cudaMalloc(&pSSIM, sizeof(Npp32f));
  nSrc1Step = idim*sizeof(Npp8u);
  nSrc2Step = idim*sizeof(Npp8u);
  oSizeROI.width = idim;
  oSizeROI.height = idim;
  // create an image with a square in it
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++)
      if ((i>sp1) && (i<sp2) && (j>sp1)&&(j<sp2)) hpSrc1[idim*i+j] = 50;
      else hpSrc1[idim*i+j] = 5;
  // create a horizontal blur for 2nd image
  memcpy(hpSrc2, hpSrc1, idim*idim*sizeof(Npp8u));
  for (int i = 0; i < idim; i++)
    for (int j = 1; j < idim-1; j++)
      hpSrc2[idim*i+j] = (hpSrc1[idim*i+j-1] + hpSrc1[idim*i+j] + hpSrc1[idim*i+j+1])/3;
  // copy images to device
  cudaMemcpy(pSrc1, hpSrc1, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  cudaMemcpy(pSrc2, hpSrc2, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  // allocate temp space for nppiSSIM
  stat = nppiSSIMGetBufferHostSize_8u_C1R(oSizeROI,&hBufferSize);
  assert(stat == NPP_SUCCESS);
  cudaMalloc(&pDeviceBuffer, hBufferSize);
  // calculate SSIM
  stat = nppiSSIM_8u_C1R(pSrc1, nSrc1Step, pSrc2, nSrc2Step, oSizeROI, pSSIM, pDeviceBuffer);
  assert(stat == NPP_SUCCESS);
  // copy results to host
  cudaMemcpy(&hSSIM, pSSIM, sizeof(Npp32f), cudaMemcpyDeviceToHost);
  // print results
  std::cout << "Structural Similarity: " << hSSIM << std::endl;
  // now repeat with 2 random images
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++){
      hpSrc1[idim*i+j] = (Npp8u)((rand()/(float)RAND_MAX) * 255);
      hpSrc2[idim*i+j] = (Npp8u)((rand()/(float)RAND_MAX) * 255);}
  // copy images to device
  cudaMemcpy(pSrc1, hpSrc1, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  cudaMemcpy(pSrc2, hpSrc2, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  // calculate SSIM
  stat = nppiSSIM_8u_C1R(pSrc1, nSrc1Step, pSrc2, nSrc2Step, oSizeROI, pSSIM, pDeviceBuffer);
  assert(stat == NPP_SUCCESS);
  // copy results to host
  cudaMemcpy(&hSSIM, pSSIM, sizeof(Npp32f), cudaMemcpyDeviceToHost);
  // print results
  std::cout << "Structural Similarity: " << hSSIM << std::endl;
}

$ nvcc -o t314 t314.cu -lnppist
$ cuda-memcheck ./t314
========= CUDA-MEMCHECK
Structural Similarity: 0.999992
Structural Similarity: 0.0258109
========= ERROR SUMMARY: 0 errors
$

Thank you for the supplemented example! That must be helpful for me.

BTW, I have an inquiry about the relation between nppiSSIM_8u_C1R() function and the given image’s resolution.
When I tried to test the nppiSSIM_8u_C1R function with a condition as below, I got numerous error message from cuda-memcheck tool.

#################################

Cond.

#################################

nppiSSIMGetBufferHostSize_8u_C1R({3072 * 3, 3072}, &bufSSIMSize);
cudaMalloc((void **) &bufSSIM, bufSSIMSize);
 
sizeROI = {541 * 3, 960};
nppiSSIM_8u_C1R(src1.data, src1.pitch, src2.data, src2.pitch, oSizeROI, retSSIM, bufSSIM);  // the spot the cuda-memcheck reported.

#################################

The message from cuda-memcheck

#################################
Please refer the first line to see pointer address.

2018-11-02 12:39:39.018 INFO   (PymInspector.c:480) [hello-forward]> src1->data,pitch=[0x7f6242000000, 3072], src2->data,pitch=[0x7f6244000000, 3072], roi=[1623, 960], bufSSIM=[0x7f624ee00000]

========= Invalid __global__ write of size 4
=========     at 0x000019b8 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<
float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (4,3,0) in block (8,10,0)
=========     Address 0x7f624f3e88fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_getSSIM + 0x2b9) [0x289b4]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (pymInspector_getMigrationSSIM + 0xc7) [0x2551e]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_transcodeJpegToHevc + 0x5f8) [0x1f579]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_convertForwardPym + 0xa8) [0x1ed35]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (convertForward + 0x301) [0x1d71f]
=========     Host Frame:./pymTest (forwardPymTest + 0x4c9) [0x16b3]
=========     Host Frame:/usr/lib64/libpthread.so.0 [0x7e25]
=========     Host Frame:/usr/lib64/libc.so.6 (clone + 0x6d) [0xf834d]
=========

(I’m sorry to inconvenience the insufficient information for this issue, but I’m sure the src1 and src2 memory space is no problem.)

If I change the allocation code like below, the cuda-memcheck error is gone.
→ cudaMalloc((void **) &bufSSIM, bufSSIMSize * 2);

Also, If I use the 3072x3072 RGB24 Interleaved image instead of the above example image, that error is gone.
(In that case, the sizeROI is going to be {3072 * 3, 3072})

I wonder why nppiSSIM_8u_C1R function references to the illegal memory area in the above case.
Thank you a lot. :)

Perhaps because you are using the wrong buffer function.

Why are you using the PSNR buffer function:

nppiPSNRGetBufferHostSize_8u_C1R
    ^^^^

If you want to find the buffer size for SSIM, you should use the SSIM buffer size function. Refer to the example I already gave you.

Oh I’m sorry to my mistake that I copied wrong my code.

I did test SSIM function with ‘nppiSSIMGetBufferHostSize_8u_C1R’ and got a same result as my previous post.

nppiSSIMGetBufferHostSize_8u_C1R({3072 * 3, 3072}, &bufSSIMSize);
cudaMalloc((void **) &bufSSIM, bufSSIMSize);
 
sizeROI = {541 * 3, 960};
nppiSSIM_8u_C1R(src1.data, src1.pitch, src2.data, src2.pitch, oSizeROI, retSSIM, bufSSIM);  // the spot the cuda-memcheck reported.

Sorry to inconvenience of my mistake, Would I get your answer again for this problem?

Thank you a lot!

You are defining sizeROI different from the ROI size you passed to the buffer function, and you are passing oSizeROI to the SSIM calculation function. That is all messed up.

Let’s not go back and forth like this any longer, please.

Please provide a short, complete code, just like the one that I provided you, that demonstrates the problem. I won’t be able to work with snippets, or bits and pieces.

Sorry to ask you questions with the sample code unprepared.
If the sample code is ready in the future, I will ask you again.

Thank you for your response.

Hi robert, i’m working with seung51hoon now.

And i prepared complete code for this problem

void main() {
    int frameWidth = 4096;
    int frameHeight = 4096;

    unsigned int pitch = 4096;

    int imageWidth = 718;
    int imageHeight = 956;

    uint8_t * srcBuffer;
    uint8_t * refBuffer;

    uint8_t * scratchBuffer;
    unsigned int sizeScratch;

    float * result;
    unsigned int offset;

    NppiSize size = {
            frameWidth,
            frameHeight
    };

    NppiSize lumaROI = {
            imageWidth,
            imageHeight
    };

    NppiSize chromaROI = {
            imageWidth / 2,
            imageHeight / 2
    };

    cuInit(0);

    cudaMalloc(&srcBuffer, frameWidth * frameHeight * 1.5);
    cudaMalloc(&refBuffer, frameWidth * frameHeight * 1.5);

    nppiSSIMGetBufferHostSize_8u_C1R(size, &sizeScratch);
    sizeScratch *= 3;
    cudaMalloc((void **)&scratchBuffer, sizeScratch);
    cudaMalloc((void **)&result, sizeof(Npp32f) * 1);

    printf("src : %p, ref : %p, scratch : %p\n", srcBuffer, refBuffer, scratchBuffer);

    //Y
    printf("Luma\n");
    nppiSSIM_8u_C1R(srcBuffer, pitch, refBuffer, pitch, lumaROI, result, scratchBuffer);

    //U
    printf("U\n");
    offset = pitch * frameHeight;
    nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);

    //V
    printf("V\n");
    offset += (pitch * frameHeight / 4);
    nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);

    cudaFree(srcBuffer);
    cudaFree(refBuffer);
    cudaFree(scratchBuffer);
}

when i use

sizeScratch *= 3;

there’s no problem

========= CUDA-MEMCHECK
src : 0x7fb08e000000, ref : 0x7fb080000000, scratch : 0x7fb081800000
Luma
U
V
========= ERROR SUMMARY: 0 errors

but when i decrease the number below than 3, such as 2 or 1, this occurs

cuda-memcheck ./pymTest sample/test.conf
========= CUDA-MEMCHECK
src : 0x7f00b0000000, ref : 0x7f00ae000000, scratch : 0x7f00af800000
Luma
U
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (31,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/js100/pym/libpym.so (test + 0x1fb) [0x1c88f]
=========     Host Frame:./pymTest (main + 0x19) [0x3708]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./pymTest [0x1499]
=========
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (30,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b4 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/js100/pym/libpym.so (test + 0x1fb) [0x1c88f]
=========     Host Frame:./pymTest (main + 0x19) [0x3708]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./pymTest [0x1499]
=========
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (29,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]

Could you give us a hint?
We’re spending much time for this problem since we’ve decided to apply SSIM to our application.

I’ve already indicated the reason why.

The size parameter you are passing to the get buffer size function:

nppiSSIMGetBufferHostSize_8u_C1R(size, &sizeScratch);
                                 ^^^^

does not match the size parameter you are using for the actual SSIM functions:

nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);
                                                                              ^^^^^^^^^

Why would you think this is OK or legal? Do you think the size parameter has no relationship to the actual function use? If so, how could requesting a particular size make any sense at all?

They have to match. Add a printf statement to your code that prints out the value of sizeScratch, and make a table of the values you get when you pass size, lumaROI, and chromaROI, to the buffer size function.

So you mean if i like to allocate pretty big frame buffer and extract SSIM of specific area is not possible?
We prepare one whole big frame buffer, let’s say 4k X 4k.
Then we decode image on big frame buffer with pitch of 4K.

Then we extract SSIM.
Image resolution could be any size less than 4k X 4k.

Then do we have to allocate frame buffer every time we decode images?

And the variable size is bigger than chromaROI.

And if the size of scratch buffer increases along with frame buffer size,
why this problem happens?

You have to allocate a larger buffer any time the function returns a value that is larger than the one you already allocated. If the function returns a value that is smaller than the one you already allocated, you can reuse your existing buffer.

It doesn’t. There is no specified relationship between the requested size and the scratch buffer size. You may have assumed that there was a relationship of the type you indicated, but that is not indicated anywhere in the NPP specification, and it is in fact not the case.

I encourage you to perform the experiment I already indicated (print out the buffer sizes for each of the requested frame sizes you have in your example.)

Sorry i misunderstood.
The size of scratch buffer does not vary along with whole pixel number.
i got it.