NVCC Compling question, where is the lmem?

I have CUDA application. When I compile the application with the following command

nvcc -arch=sm_20 -maxrregcount 60 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

with allows maximum 60 register for for the kernels. Here is the output.

ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 47 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]

The maximum number of registers used by the kernel is 47.

Then, I limit the maximum number of registers used by the kernel to 20, with the command

nvcc -arch=sm_20 -maxrregcount 20 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

Here is the compile output:

ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 20 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]

Since the maximum number of registers was now limited to 20, there would be register spilling to the local memory. But I did not see any “lmem” from the above compiling output. I also use compute vidual profile to profile the cuda application, and did see operations on the local memory. But why I did not see any “lmem” from the compiling outputs? Could anybody help me out? thanks.

My GPU is Fermi 2050, and NVCC is 3.2. System is: Linux.

thanks

Is it possible to compile your code without the sm_20 option, but with the two different register settings? I’m curious if you aren’t seeing lmem explicitly listed due to the compiler putting these extra variables onto a function stack.

Thanks,

But I cannot take the sm_20 option off. I have used some sm_20 features in the kernel. Without it, the compilation won’t go through.

I think seibert is correct here - the compiler is probably spilling to internal heap in Fermi. The way the compiler is reporting memory usage might be a consequence of the “unified” memory space model Fermi uses. Have a look at this completely contrived example:

template<int rows>

__device__ void copy( const double *in, double *out)

{

    for(int i=0; i<rows; i++) out[i] = in[i];

}

template<int rows>

__device__ double dotprod(const double *x, double *y)

{

    double result = 0.;

	for(int i=0; i<rows; i++)

		result += x[i] * y [i];

return result;

}

template<int rows>

__global__ void kernel(double *r, const double *a, const double *b, const double *c, const int N)

{

    double vec1[rows], vec2[rows];

    int tidx = threadIdx.x + blockDim.x * gridDim.x;

    int indx = rows * tidx;

if (tidx < N) {

copy<rows>(&a[indx], vec1);

        copy<rows>(&b[indx], vec2);

        double p1 = dotprod<rows>(vec1, vec2);

copy<rows>(&b[indx], vec1);

        copy<rows>(&c[indx], vec2);

        double p2 = dotprod<rows>(vec1, vec2);

copy<rows>(&c[indx], vec1);

        copy<rows>(&a[indx], vec2);

        double p3 = dotprod<rows>(vec1, vec2);

r[tidx] = max(p1, max(p2, p3));

    }

}

template void kernel<4> (double *, const double *, const double *, const double *, int);

template void kernel<16> (double *, const double *, const double *, const double *, int);

template void kernel<64> (double *, const double *, const double *, const double *, int);

template void kernel<256> (double *, const double *, const double *, const double *, int);

template void kernel<1024> (double *, const double *, const double *, const double *, int);

Which just computes some dot products in thread local memory. Compiling it for compute 1.3 gives this:

avidday@cuda:~$ nvcc -c -arch=sm_13 -Xptxas="-v" spillage.cu 

ptxas info    : Compiling entry function '_Z6kernelILi1024EEvPdPKdS2_S2_i' for 'sm_13'

ptxas info    : Used 12 registers, 16384+0 bytes lmem, 36+16 bytes smem, 4 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6kernelILi256EEvPdPKdS2_S2_i' for 'sm_13'

ptxas info    : Used 12 registers, 4096+0 bytes lmem, 36+16 bytes smem, 4 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6kernelILi64EEvPdPKdS2_S2_i' for 'sm_13'

ptxas info    : Used 12 registers, 1024+0 bytes lmem, 36+16 bytes smem, 4 bytes cmem[1]

ptxas info    : Compiling entry function '_Z6kernelILi16EEvPdPKdS2_S2_i' for 'sm_13'

ptxas info    : Used 16 registers, 36+16 bytes smem

ptxas info    : Compiling entry function '_Z6kernelILi4EEvPdPKdS2_S2_i' for 'sm_13'

ptxas info    : Used 28 registers, 36+16 bytes smem

which is what you would expect - lots of local memory usage at the big array sizes. But for compute 2.0. it looks very different:

avidday@cuda:~$ nvcc -c -arch=sm_20 -Xptxas="-v" spillage.cu 

ptxas info    : Compiling entry function '_Z6kernelILi1024EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Used 17 registers, 8+0 bytes lmem, 68 bytes cmem[0], 4 bytes cmem[16]

ptxas info    : Compiling entry function '_Z6kernelILi256EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Used 30 registers, 8+0 bytes lmem, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi64EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Used 23 registers, 8+0 bytes lmem, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi16EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Used 63 registers, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi4EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Used 27 registers, 68 bytes cmem[0]

ie no big local memory allocations at all. Without a disassembler it is pretty hard to say exactly what is going on, but it would seem that the arrays are going to heap. Heap memory transactions probably still profile as local memory, because they are still per thread off chip memory, but the way the compiler reports things looks different.

thanks, now it makes sense to me.

And just to round things out, it seems that the nvcc version shipping in the 4.0rc release reports things much more explicitly:

avidday@cuda:~$ nvcc -c -arch=sm_20 -Xptxas="-v" spillage.cu 

ptxas info    : Compiling entry function '_Z6kernelILi1024EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Function properties for _Z6kernelILi1024EEvPdPKdS2_S2_i

    16384 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info    : Used 17 registers, 8+0 bytes lmem, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi256EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Function properties for _Z6kernelILi256EEvPdPKdS2_S2_i

    4096 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info    : Used 22 registers, 8+0 bytes lmem, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi64EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Function properties for _Z6kernelILi64EEvPdPKdS2_S2_i

    1024 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info    : Used 25 registers, 8+0 bytes lmem, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi16EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Function properties for _Z6kernelILi16EEvPdPKdS2_S2_i

    64 bytes stack frame, 64 bytes spill stores, 64 bytes spill loads

ptxas info    : Used 63 registers, 68 bytes cmem[0]

ptxas info    : Compiling entry function '_Z6kernelILi4EEvPdPKdS2_S2_i' for 'sm_20'

ptxas info    : Function properties for _Z6kernelILi4EEvPdPKdS2_S2_i

    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info    : Used 27 registers, 68 bytes cmem[0]

so it is easy to see that the spill from register is going into heap/stack space for compute 2.0 targets.