Do nppiTranspose have image size limit?

9453148 · August 8, 2024, 3:30am

I found it do not work on image size more than 32Mib.

Robert_Crovella · August 8, 2024, 11:24pm

I don’t see any specific limit mentioned in the documentation. The size of the operation is governed by the oSrcROI parameter, which is an NppSize type, which contains two int quantities to indicate the horizontal and vertical size of the transpose operation. Therefore my assumption would be that you want to choose image dimensions that can be “safely” represented in int type, including the product of the horizontal and vertical dimension, i.e. as if you were indexing into it with a single index.

With that proviso, a transpose of dimensions 32,768x32,768 (which comes out to 1 GB of data, for one channel Npp8u image pixel type) seems to work properly for me:

# cat t250.cu
#include <npp.h>
#include <iostream>

const int idim=32768;
const int s = idim*idim;

int main(){

  Npp8u *h, *d, *o;
  h = new Npp8u[s];
  cudaMalloc(&d, sizeof(d[0])*s);
  cudaMalloc(&o, sizeof(o[0])*s);
  for (int i = 0; i < s; i++) h[i] = i%3;
  cudaMemcpy(d, h, s, cudaMemcpyHostToDevice);
  NppiSize ns = {idim, idim};
  NppStatus stat = nppiTranspose_8u_C1R(d, idim, o, idim, ns);
  if (stat != NPP_SUCCESS) std::cout << "Npp error: " << (int)stat << std::endl;
  cudaMemcpy(h, o, s, cudaMemcpyDeviceToHost);
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++)
      if (h[j*idim+i] != (i*idim+j)%3) {std::cout << "Mismatch at: " << j*idim+i << " was: " << h[j*idim+i] << " should be: " << (i*idim+j)%3 << std::endl; return 0;}
  std::cout << "Success" << std::endl;
  return 0;
}

# nvcc -o t250 t250.cu -lnppidei
# compute-sanitizer ./t250
========= COMPUTE-SANITIZER
Success
========= ERROR SUMMARY: 0 errors
#

9453148 · August 11, 2024, 2:32am

hi, Robert
thanks for your reply ,I modified your code slightly for my case , it works fine when ydim is small ,such as 1024. while failed for 1024*1024, did I made any mistakes ?
in my app,I use ROI (4, 1336440 ) or even more.

//# cat t250.cu
#include <npp.h>
#include <iostream>

const int xdim=4;
const int ydim=1024*1024;//2672840*8;
const int s = xdim*ydim*2;

int main(){

  Npp16s *h, *d, *o;
  h = new Npp16s[s];
  printf("size %fMib\n",s/1024.0/1024);
  cudaMalloc(&d, sizeof(d[0])*s);
  cudaMalloc(&o, sizeof(o[0])*s);
  for (int i = 0; i < s; i++) h[i] = i%3+1;
  cudaMemcpy(d, h, s, cudaMemcpyHostToDevice);
  NppiSize ns = {xdim, ydim};
  int xstep=2*xdim;
  int ystep=2*ydim;
  NppStatus stat = nppiTranspose_16s_C1R(d, xstep, o, ystep, ns);
  if (stat != NPP_SUCCESS) std::cout << "Npp error: " << (int)stat << std::endl;
  printf("---\n");
  cudaMemcpy(h, o, s, cudaMemcpyDeviceToHost);
  for (int i = 0; i < xdim; i++)
    for (int j = 0; j < ydim; j++)
    {
        int idx=i*ydim+j;
        int idx2=idx/xdim+idx%xdim*ydim;
      if (h[idx2] != (idx)%3+1)
      {
              printf(" %d(%d),i=%d,j=%d,%d(%d)\n",idx2,h[idx2],i,j,idx,idx%3);
              //std::cout << "Mismatch at: " << j*xdim+i << " was: " << h[j*xdim+i] << " should be: " << (i*xdim+j)%3 << std::endl; return 0;
      }
    }
  std::cout << "Success" << std::endl;
  return 0;
}

9453148 · August 11, 2024, 8:44am

compute-sanitizer report error

========= COMPUTE-SANITIZER
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4808d6]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x290b08]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
=========     Host Frame:nppiTranspose_16s_C1R [0x21d1d9]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
=========     Host Frame:main in /home/rda/XPAR23/mainb/t250.cu:21 [0x3a5f]
=========                in /home/rda/XPAR23/mainb/./t250
=========     Host Frame:__libc_start_main [0x236a3]
=========                in /lib64/libc.so.6
=========     Host Frame:_start [0x388e]
=========                in /home/rda/XPAR23/mainb/./t250
=========

Robert_Crovella · August 12, 2024, 3:32pm

Yes, there appears to be a limit on the y dimension that is some number greater than 1047552 but less than 1048576. (The maximum acceptable value for y appears to be (1024x1024)-16) If this is a problem for you, you could file a bug. However within the acceptable range of y, it seems that x can be much larger than 16 or 32, so there doesn’t seem to be a particular limit at 32Mb as indicated in your original posting.

FWIW, I believe x has a larger limit than y, so I believe a transpose of (1572864,128) should be possible, for example.

9453148 · August 15, 2024, 12:56am

Thank you very much ,Robert . I have to transpose a matrix (4,x)
which x may very large in the project . now I apply several coalesced transposes to avoid this problem.

Yuki_Ni · August 16, 2024, 9:39am

This maps to internal ticket 4807542 . We will get back conclusion here when it is completed .

Topic		Replies	Views
Invalid Memory reads with NPP Distance Transform on Empty Image GPU-Accelerated Libraries npp	0	19	March 18, 2025
Problem about nppiFilterMin and nppiFilterMax MaskSize GPU-Accelerated Libraries npp	8	338	December 3, 2024
Calling NPP helper with large image gives kernel execution error GPU-Accelerated Libraries npp	3	1815	November 11, 2021
Issue with NPP and opencv does not work with all image sizes? GPU-Accelerated Libraries opencv	0	1115	August 12, 2016
Using nppiResizeBatch_8u_C3R causes exception wrap illegal address GPU-Accelerated Libraries npp	3	806	August 24, 2022
Very poor performance with NPP CrossCorrValid GPU-Accelerated Libraries npp	8	3296	May 25, 2022
NPP; Morphological Operations; Dilate operation gives strange results GPU-Accelerated Libraries npp	3	255	May 22, 2024
[closed]nppiRemap_8u_C3R function GPU-Accelerated Libraries	5	1899	June 11, 2016
Problem when using NPP libirary, nppiMinIndx_32f_C1R() GPU-Accelerated Libraries	8	1459	July 31, 2018
Transpose matrix like 8x1M in bytes by memcpy2d CUDA Programming and Performance cuda	10	50	November 13, 2024

Do nppiTranspose have image size limit?

Related topics