Why does this iterator cause lmem usage?

I have a simple iterator that I’m trying to use to pass into a scan function (as found in modernGPU or CUB) to do a scatter_if while avoiding writing the result of the scan back to memory.

For some reason it uses a lot of registers and local memory compared with a direct implementation of the same functionality. Can anybody point out to me why? I’ve tested with both 5.0 and 5.5.

Iterator compiles with 20 registers and 48 bytes lmem.

Direct implementation uses 5 registers and 0 bytes lmem.

struct scatterIterator {
  char* m_pred;
  int*  m_dst;
  int   m_index;

  __host__ __device__
  scatterIterator(char* pred, int *dst) : m_pred(pred), m_dst(dst) {}

  __host__ __device__
  scatterIterator(char* pred, int *dst, int index) : m_pred(pred), m_dst(dst), m_index(index) {}

  __host__ __device__
  scatterIterator operator[](int i) const
  {
    return scatterIterator(m_pred, m_dst, i); 
  }

  __host__ __device__
  void operator =(int dst)
  {
    if (m_pred[m_index]) {
      m_dst[dst] = m_index;
    }   
  }
};

template<typename OutputIt>
__global__
void testKernel1(OutputIt it) {
  it[threadIdx.x] = blockIdx.x;
  //if (it.m_pred[threadIdx.x])
  //  it.m_dst[blockIdx.x] = threadIdx.x;
}

int main(void) {
  scatterIterator output_begin(NULL, NULL);
  testKernel1<<<1, 1>>>(output_begin);
  return 0;
};

I might be misunderstanding your question, but your snippet compiles down to 4 registers:

nvcc -m 32 -arch sm_35 -Xptxas=-v -cubin iterator.cu
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z11testKernel1I15scatterIteratorEvT_' for 'sm_35'
ptxas : info : Function properties for _Z11testKernel1I15scatterIteratorEvT_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 4 registers, 332 bytes cmem[0]

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_10_13:36:45_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0

Depending on how exactly the code is compiled [e.g. debug build or separate compilation], the lmem usage could be for the stack frame of a called function, and the higher register usage could be a side effect of the ABI function call interface. That’s just speculation at this point, there is not enough data to diagnose what might be going on. Is the use of lmem or additional registers causing problems?

@allanmac - apparently the problem is caused only when you specify -abi=no to ptxas, which is why your command doesn’t show any problem.

@njuffa - it causes a large slowdown from the expected performance.

But interestingly, when I run the same command (but I compile for 64 bits) I get a different usage?!

nvcc -arch=sm_35 -Xptxas -v -cubin itTest.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z11testKernel1I15scatterIteratorEvT_' for 'sm_35'
ptxas info    : Function properties for _Z11testKernel1I15scatterIteratorEvT_
    48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 19 registers, 344 bytes cmem[0]

Compiling for 32 bits solves the problem. No stack frame and the register usage goes down to 4.

Is that expected? I wouldn’t think 32/64 bits should cause such a dramatic difference.

Odd, I still don’t see the problem on Win7/x64+VS2010:

nvcc -m 32 -arch sm_35 -Xptxas=-v -cubin iterator.cu
ptxas : info : Used 4 registers, 332 bytes cmem[0]

nvcc -m 32 -arch sm_35 -Xptxas=-v,-abi=no -cubin iterator.cu
ptxas : info : Used 3 registers, 332 bytes cmem[0]

nvcc -m 64 -arch sm_35 -Xptxas=-v -cubin iterator.cu
ptxas : info : Used 6 registers, 344 bytes cmem[0]

nvcc -m 64 -arch sm_35 -Xptxas=-v,-abi=no -cubin iterator.cu
ptxas : info : Used 4 registers, 344 bytes cmem[0]

ptxas:

ptxas -V
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_10_13:33:55_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0

Hmmm…I’m on linux. Seems odd that it would be platform specific. I guess I should file a bug, since this seems like one to me now.

Building code for a 64-bit platform forces all pointers (including those generated by the compiler via strength reduction and induction variable creation), so it is not unusual for device code to use more registers when compiled for a 64-bit platform, when comapred to a 32-bit build.

In general, I highly recommend the use of the ABI (which is the compiler default, and specifically disabled by passing -abi=no). Are you getting warnings from the compiler when using -abi=no? I am not up to date on the support status of compiling without use of the ABI.

Since you are already using CUDA 5.5, filing a bug seems like a good idea to get this sorted out. There should not be any differences between code egenrated in Windows and Linux, unless platform specific type differences come into play which are inherited by CUDA for reasons of interoperability with host code (in particular, the type ‘long’ is 32 bits on 64-bit Windows but 64 bits on 64-bit Linux).