Is there any tutorials for nppiSSIM_8u_C1R?

seung51hoon · October 31, 2018, 2:25am

Hello, NPP developer.

I’m looking for some tutorials of nppiSSIM_8u_C1R function.
The CUDA samples don’t have any SSIM statistics example.

Could I get some example for nppiSSIM_8u_C1R function?
Also, Is nppiSSIM_8u_C1R supported RGB24 format?

Thank you.

Robert_Crovella · November 1, 2018, 1:04am

The function computes the Structural Similarity Index (SSIM) between regions of interest in two 8-bit unsigned single-channel images. The calculation should be similar to what is described here:

https://en.wikipedia.org/wiki/Structural_similarity

RGB24 format would be something like 8u_C3R and there is no variant of the function like that. You might be able to treat a 3-channel interleaved image format like RGB24 as if it were a single channel 8-bit unsigned image (with 3 times the width) for the purpose of the calculation, but I don’t know for certain.

Here is a worked example:

$ cat t314.cu
#include <nppi.h>
#include <iostream>
#include <assert.h>
#include <cstdlib>

const int idim = 512;
const int sp1 = 10;
const int sp2 = 20;

int main(){

Npp8u *pSrc1, *pSrc2, *pDeviceBuffer, *hpSrc1, *hpSrc2;
  int   nSrc1Step, nSrc2Step, hBufferSize;
  NppiSize oSizeROI;
  Npp32f *pSSIM, hSSIM;
  NppStatus stat;

  // allocate
  hpSrc1 = new Npp8u[idim*idim];
  hpSrc2 = new Npp8u[idim*idim];
  cudaMalloc(&pSrc1, idim*idim*sizeof(Npp8u));
  cudaMalloc(&pSrc2, idim*idim*sizeof(Npp8u));
  cudaMalloc(&pSSIM, sizeof(Npp32f));
  nSrc1Step = idim*sizeof(Npp8u);
  nSrc2Step = idim*sizeof(Npp8u);
  oSizeROI.width = idim;
  oSizeROI.height = idim;
  // create an image with a square in it
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++)
      if ((i>sp1) && (i<sp2) && (j>sp1)&&(j<sp2)) hpSrc1[idim*i+j] = 50;
      else hpSrc1[idim*i+j] = 5;
  // create a horizontal blur for 2nd image
  memcpy(hpSrc2, hpSrc1, idim*idim*sizeof(Npp8u));
  for (int i = 0; i < idim; i++)
    for (int j = 1; j < idim-1; j++)
      hpSrc2[idim*i+j] = (hpSrc1[idim*i+j-1] + hpSrc1[idim*i+j] + hpSrc1[idim*i+j+1])/3;
  // copy images to device
  cudaMemcpy(pSrc1, hpSrc1, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  cudaMemcpy(pSrc2, hpSrc2, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  // allocate temp space for nppiSSIM
  stat = nppiSSIMGetBufferHostSize_8u_C1R(oSizeROI,&hBufferSize);
  assert(stat == NPP_SUCCESS);
  cudaMalloc(&pDeviceBuffer, hBufferSize);
  // calculate SSIM
  stat = nppiSSIM_8u_C1R(pSrc1, nSrc1Step, pSrc2, nSrc2Step, oSizeROI, pSSIM, pDeviceBuffer);
  assert(stat == NPP_SUCCESS);
  // copy results to host
  cudaMemcpy(&hSSIM, pSSIM, sizeof(Npp32f), cudaMemcpyDeviceToHost);
  // print results
  std::cout << "Structural Similarity: " << hSSIM << std::endl;
  // now repeat with 2 random images
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++){
      hpSrc1[idim*i+j] = (Npp8u)((rand()/(float)RAND_MAX) * 255);
      hpSrc2[idim*i+j] = (Npp8u)((rand()/(float)RAND_MAX) * 255);}
  // copy images to device
  cudaMemcpy(pSrc1, hpSrc1, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  cudaMemcpy(pSrc2, hpSrc2, idim*idim*sizeof(Npp8u), cudaMemcpyHostToDevice);
  // calculate SSIM
  stat = nppiSSIM_8u_C1R(pSrc1, nSrc1Step, pSrc2, nSrc2Step, oSizeROI, pSSIM, pDeviceBuffer);
  assert(stat == NPP_SUCCESS);
  // copy results to host
  cudaMemcpy(&hSSIM, pSSIM, sizeof(Npp32f), cudaMemcpyDeviceToHost);
  // print results
  std::cout << "Structural Similarity: " << hSSIM << std::endl;
}

$ nvcc -o t314 t314.cu -lnppist
$ cuda-memcheck ./t314
========= CUDA-MEMCHECK
Structural Similarity: 0.999992
Structural Similarity: 0.0258109
========= ERROR SUMMARY: 0 errors
$

seung51hoon · November 1, 2018, 3:22am

Thank you for the supplemented example! That must be helpful for me.

BTW, I have an inquiry about the relation between nppiSSIM_8u_C1R() function and the given image’s resolution.
When I tried to test the nppiSSIM_8u_C1R function with a condition as below, I got numerous error message from cuda-memcheck tool.

#################################

Cond.

#################################

nppiSSIMGetBufferHostSize_8u_C1R({3072 * 3, 3072}, &bufSSIMSize);
cudaMalloc((void **) &bufSSIM, bufSSIMSize);
 
sizeROI = {541 * 3, 960};
nppiSSIM_8u_C1R(src1.data, src1.pitch, src2.data, src2.pitch, oSizeROI, retSSIM, bufSSIM);  // the spot the cuda-memcheck reported.

#################################

The message from cuda-memcheck

#################################
Please refer the first line to see pointer address.

2018-11-02 12:39:39.018 INFO   (PymInspector.c:480) [hello-forward]> src1->data,pitch=[0x7f6242000000, 3072], src2->data,pitch=[0x7f6244000000, 3072], roi=[1623, 960], bufSSIM=[0x7f624ee00000]

========= Invalid __global__ write of size 4
=========     at 0x000019b8 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<
float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (4,3,0) in block (8,10,0)
=========     Address 0x7f624f3e88fc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_getSSIM + 0x2b9) [0x289b4]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (pymInspector_getMigrationSSIM + 0xc7) [0x2551e]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_transcodeJpegToHevc + 0x5f8) [0x1f579]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (_convertForwardPym + 0xa8) [0x1ed35]
=========     Host Frame:/home1/irteam/test/seunghoon.baek/181102/pym/libpym.so (convertForward + 0x301) [0x1d71f]
=========     Host Frame:./pymTest (forwardPymTest + 0x4c9) [0x16b3]
=========     Host Frame:/usr/lib64/libpthread.so.0 [0x7e25]
=========     Host Frame:/usr/lib64/libc.so.6 (clone + 0x6d) [0xf834d]
=========

(I’m sorry to inconvenience the insufficient information for this issue, but I’m sure the src1 and src2 memory space is no problem.)

If I change the allocation code like below, the cuda-memcheck error is gone.
→ cudaMalloc((void **) &bufSSIM, bufSSIMSize * 2);

Also, If I use the 3072x3072 RGB24 Interleaved image instead of the above example image, that error is gone.
(In that case, the sizeROI is going to be {3072 * 3, 3072})

I wonder why nppiSSIM_8u_C1R function references to the illegal memory area in the above case.
Thank you a lot. :)

Robert_Crovella · November 1, 2018, 1:23pm

Perhaps because you are using the wrong buffer function.

Why are you using the PSNR buffer function:

nppiPSNRGetBufferHostSize_8u_C1R
    ^^^^

If you want to find the buffer size for SSIM, you should use the SSIM buffer size function. Refer to the example I already gave you.

seung51hoon · November 2, 2018, 1:14am

Oh I’m sorry to my mistake that I copied wrong my code.

I did test SSIM function with ‘nppiSSIMGetBufferHostSize_8u_C1R’ and got a same result as my previous post.

nppiSSIMGetBufferHostSize_8u_C1R({3072 * 3, 3072}, &bufSSIMSize);
cudaMalloc((void **) &bufSSIM, bufSSIMSize);
 
sizeROI = {541 * 3, 960};
nppiSSIM_8u_C1R(src1.data, src1.pitch, src2.data, src2.pitch, oSizeROI, retSSIM, bufSSIM);  // the spot the cuda-memcheck reported.

Sorry to inconvenience of my mistake, Would I get your answer again for this problem?

Thank you a lot!

Robert_Crovella · November 2, 2018, 3:42am

You are defining sizeROI different from the ROI size you passed to the buffer function, and you are passing oSizeROI to the SSIM calculation function. That is all messed up.

Let’s not go back and forth like this any longer, please.

Please provide a short, complete code, just like the one that I provided you, that demonstrates the problem. I won’t be able to work with snippets, or bits and pieces.

seung51hoon · November 3, 2018, 11:52am

Sorry to ask you questions with the sample code unprepared.
If the sample code is ready in the future, I will ask you again.

Thank you for your response.

firejoon · November 28, 2018, 1:21am

Hi robert, i’m working with seung51hoon now.

And i prepared complete code for this problem

void main() {
    int frameWidth = 4096;
    int frameHeight = 4096;

    unsigned int pitch = 4096;

    int imageWidth = 718;
    int imageHeight = 956;

    uint8_t * srcBuffer;
    uint8_t * refBuffer;

    uint8_t * scratchBuffer;
    unsigned int sizeScratch;

    float * result;
    unsigned int offset;

    NppiSize size = {
            frameWidth,
            frameHeight
    };

    NppiSize lumaROI = {
            imageWidth,
            imageHeight
    };

    NppiSize chromaROI = {
            imageWidth / 2,
            imageHeight / 2
    };

    cuInit(0);

    cudaMalloc(&srcBuffer, frameWidth * frameHeight * 1.5);
    cudaMalloc(&refBuffer, frameWidth * frameHeight * 1.5);

    nppiSSIMGetBufferHostSize_8u_C1R(size, &sizeScratch);
    sizeScratch *= 3;
    cudaMalloc((void **)&scratchBuffer, sizeScratch);
    cudaMalloc((void **)&result, sizeof(Npp32f) * 1);

    printf("src : %p, ref : %p, scratch : %p\n", srcBuffer, refBuffer, scratchBuffer);

    //Y
    printf("Luma\n");
    nppiSSIM_8u_C1R(srcBuffer, pitch, refBuffer, pitch, lumaROI, result, scratchBuffer);

    //U
    printf("U\n");
    offset = pitch * frameHeight;
    nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);

    //V
    printf("V\n");
    offset += (pitch * frameHeight / 4);
    nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);

    cudaFree(srcBuffer);
    cudaFree(refBuffer);
    cudaFree(scratchBuffer);
}

when i use

sizeScratch *= 3;

there’s no problem

========= CUDA-MEMCHECK
src : 0x7fb08e000000, ref : 0x7fb080000000, scratch : 0x7fb081800000
Luma
U
V
========= ERROR SUMMARY: 0 errors

but when i decrease the number below than 3, such as 2 or 1, this occurs

cuda-memcheck ./pymTest sample/test.conf
========= CUDA-MEMCHECK
src : 0x7f00b0000000, ref : 0x7f00ae000000, scratch : 0x7f00af800000
Luma
U
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (31,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/js100/pym/libpym.so (test + 0x1fb) [0x1c88f]
=========     Host Frame:./pymTest (main + 0x19) [0x3708]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./pymTest [0x1499]
=========
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (30,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b4 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x23d7b3]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 (nppiSSIM_8u_C1R + 0x13) [0x23a263]
=========     Host Frame:/home1/irteam/test/js100/pym/libpym.so (test + 0x1fb) [0x1c88f]
=========     Host Frame:./pymTest (main + 0x19) [0x3708]
=========     Host Frame:/usr/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
=========     Host Frame:./pymTest [0x1499]
=========
========= Invalid __global__ write of size 4
=========     at 0x000019f0 in void TwoPassFilter32f<float, int=1, SSIMTwoPassFunctor<float, int=1>, int=11>(Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, Image<float, int=1>, NppiSize, float)
=========     by thread (29,7,0) in block (7,6,0)
=========     Address 0x7f00afbe02b0 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24c3ad]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256312]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x256507]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x285295]
=========     Host Frame:/usr/local/cuda-9.2/lib64/libnppist.so.9.2 [0x239712]

Could you give us a hint?
We’re spending much time for this problem since we’ve decided to apply SSIM to our application.

Robert_Crovella · November 28, 2018, 1:59am

I’ve already indicated the reason why.

The size parameter you are passing to the get buffer size function:

nppiSSIMGetBufferHostSize_8u_C1R(size, &sizeScratch);
                                 ^^^^

does not match the size parameter you are using for the actual SSIM functions:

nppiSSIM_8u_C1R(srcBuffer + offset, pitch / 2, refBuffer + offset, pitch / 2, chromaROI, result, scratchBuffer);
                                                                              ^^^^^^^^^

Why would you think this is OK or legal? Do you think the size parameter has no relationship to the actual function use? If so, how could requesting a particular size make any sense at all?

They have to match. Add a printf statement to your code that prints out the value of sizeScratch, and make a table of the values you get when you pass size, lumaROI, and chromaROI, to the buffer size function.

firejoon · November 28, 2018, 2:25am

So you mean if i like to allocate pretty big frame buffer and extract SSIM of specific area is not possible?
We prepare one whole big frame buffer, let’s say 4k X 4k.
Then we decode image on big frame buffer with pitch of 4K.

Then we extract SSIM.
Image resolution could be any size less than 4k X 4k.

Then do we have to allocate frame buffer every time we decode images?

And the variable size is bigger than chromaROI.

And if the size of scratch buffer increases along with frame buffer size,
why this problem happens?

Robert_Crovella · November 28, 2018, 2:31am

You have to allocate a larger buffer any time the function returns a value that is larger than the one you already allocated. If the function returns a value that is smaller than the one you already allocated, you can reuse your existing buffer.

It doesn’t. There is no specified relationship between the requested size and the scratch buffer size. You may have assumed that there was a relationship of the type you indicated, but that is not indicated anywhere in the NPP specification, and it is in fact not the case.

I encourage you to perform the experiment I already indicated (print out the buffer sizes for each of the requested frame sizes you have in your example.)

firejoon · November 28, 2018, 2:35am

Sorry i misunderstood.
The size of scratch buffer does not vary along with whole pixel number.
i got it.

Topic		Replies	Views
Issues with nppiMean_StdDev_32f from the NPP library GPU-Accelerated Libraries	15	3360	October 31, 2017
NPP library functions nppiResize_8U_C3R and nppiBGRToLab_8u_C3R differ from cv::resize() output General	10	4856	October 12, 2021
Problem with NPPi nppiSum_8u_C1R CUDA Programming and Performance	1	1918	August 29, 2011
How to use more efficiently the shared memory and 2D tiles CUDA Programming and Performance	6	1646	May 18, 2023
NPP - nppiFilter_8u_C1R returns KERNEL_EXECUTION Debug options? CUDA Programming and Performance	6	6751	April 25, 2010
Problem when using NPP libirary, nppiMinIndx_32f_C1R() GPU-Accelerated Libraries	8	1459	July 31, 2018
Strange "unspecified launch error" from a call to cublas gemm CUDA Programming and Performance	23	2688	January 19, 2019
npp nppiResize_8u_C1R gives unexpected result GPU-Accelerated Libraries	4	1159	January 1, 2020
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6062	November 16, 2010
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1017	October 16, 2023

Is there any tutorials for nppiSSIM_8u_C1R?

Cond.

The message from cuda-memcheck

Related topics