Do nppiTranspose have image size limit?

I found it do not work on image size more than 32Mib.

I don’t see any specific limit mentioned in the documentation. The size of the operation is governed by the oSrcROI parameter, which is an NppSize type, which contains two int quantities to indicate the horizontal and vertical size of the transpose operation. Therefore my assumption would be that you want to choose image dimensions that can be “safely” represented in int type, including the product of the horizontal and vertical dimension, i.e. as if you were indexing into it with a single index.

With that proviso, a transpose of dimensions 32,768x32,768 (which comes out to 1 GB of data, for one channel Npp8u image pixel type) seems to work properly for me:

# cat t250.cu
#include <npp.h>
#include <iostream>

const int idim=32768;
const int s = idim*idim;

int main(){

  Npp8u *h, *d, *o;
  h = new Npp8u[s];
  cudaMalloc(&d, sizeof(d[0])*s);
  cudaMalloc(&o, sizeof(o[0])*s);
  for (int i = 0; i < s; i++) h[i] = i%3;
  cudaMemcpy(d, h, s, cudaMemcpyHostToDevice);
  NppiSize ns = {idim, idim};
  NppStatus stat = nppiTranspose_8u_C1R(d, idim, o, idim, ns);
  if (stat != NPP_SUCCESS) std::cout << "Npp error: " << (int)stat << std::endl;
  cudaMemcpy(h, o, s, cudaMemcpyDeviceToHost);
  for (int i = 0; i < idim; i++)
    for (int j = 0; j < idim; j++)
      if (h[j*idim+i] != (i*idim+j)%3) {std::cout << "Mismatch at: " << j*idim+i << " was: " << h[j*idim+i] << " should be: " << (i*idim+j)%3 << std::endl; return 0;}
  std::cout << "Success" << std::endl;
  return 0;
}

# nvcc -o t250 t250.cu -lnppidei
# compute-sanitizer ./t250
========= COMPUTE-SANITIZER
Success
========= ERROR SUMMARY: 0 errors
#

hi, Robert
thanks for your reply ,I modified your code slightly for my case , it works fine when ydim is small ,such as 1024. while failed for 1024*1024, did I made any mistakes ?
in my app,I use ROI (4, 1336440 ) or even more.

//# cat t250.cu
#include <npp.h>
#include <iostream>

const int xdim=4;
const int ydim=1024*1024;//2672840*8;
const int s = xdim*ydim*2;

int main(){

  Npp16s *h, *d, *o;
  h = new Npp16s[s];
  printf("size %fMib\n",s/1024.0/1024);
  cudaMalloc(&d, sizeof(d[0])*s);
  cudaMalloc(&o, sizeof(o[0])*s);
  for (int i = 0; i < s; i++) h[i] = i%3+1;
  cudaMemcpy(d, h, s, cudaMemcpyHostToDevice);
  NppiSize ns = {xdim, ydim};
  int xstep=2*xdim;
  int ystep=2*ydim;
  NppStatus stat = nppiTranspose_16s_C1R(d, xstep, o, ystep, ns);
  if (stat != NPP_SUCCESS) std::cout << "Npp error: " << (int)stat << std::endl;
  printf("---\n");
  cudaMemcpy(h, o, s, cudaMemcpyDeviceToHost);
  for (int i = 0; i < xdim; i++)
    for (int j = 0; j < ydim; j++)
    {
        int idx=i*ydim+j;
        int idx2=idx/xdim+idx%xdim*ydim;
      if (h[idx2] != (idx)%3+1)
      {
              printf(" %d(%d),i=%d,j=%d,%d(%d)\n",idx2,h[idx2],i,j,idx,idx%3);
              //std::cout << "Mismatch at: " << j*xdim+i << " was: " << h[j*xdim+i] << " should be: " << (i*xdim+j)%3 << std::endl; return 0;
      }
    }
  std::cout << "Success" << std::endl;
  return 0;
}

compute-sanitizer report error

========= COMPUTE-SANITIZER
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x4808d6]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x290b08]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
=========     Host Frame:nppiTranspose_16s_C1R [0x21d1d9]
=========                in /usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
=========     Host Frame:main in /home/rda/XPAR23/mainb/t250.cu:21 [0x3a5f]
=========                in /home/rda/XPAR23/mainb/./t250
=========     Host Frame:__libc_start_main [0x236a3]
=========                in /lib64/libc.so.6
=========     Host Frame:_start [0x388e]
=========                in /home/rda/XPAR23/mainb/./t250
=========

Yes, there appears to be a limit on the y dimension that is some number greater than 1047552 but less than 1048576. (The maximum acceptable value for y appears to be (1024x1024)-16) If this is a problem for you, you could file a bug. However within the acceptable range of y, it seems that x can be much larger than 16 or 32, so there doesn’t seem to be a particular limit at 32Mb as indicated in your original posting.

FWIW, I believe x has a larger limit than y, so I believe a transpose of (1572864,128) should be possible, for example.

Thank you very much ,Robert . I have to transpose a matrix (4,x)
which x may very large in the project . now I apply several coalesced transposes to avoid this problem.

This maps to internal ticket 4807542 . We will get back conclusion here when it is completed .