ptxas info unexplained - what is cmem[n]?

I have the following ptxas info output from the compiler:

ptxas info : Compiling entry function ‘_Z12compute_dIdtPfS_S_S_S_S_llllfff’ for ‘sm_21’
ptxas info : Used 38 registers, 124 bytes cmem[0], 348 bytes cmem[2], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11convgauss_2PfS_S_S_llll’ for ‘sm_21’
ptxas info : Used 26 registers, 96 bytes cmem[0], 348 bytes cmem[2]
ptxas info : Compiling entry function ‘_Z10compute_CDPfS_S_S_llfff’ for ‘sm_21’
ptxas info : Used 19 registers, 4+0 bytes lmem, 92 bytes cmem[0], 348 bytes cmem[2], 44 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11convgauss_1PfS_llll’ for ‘sm_21’
ptxas info : Used 20 registers, 80 bytes cmem[0], 348 bytes cmem[2]

However the nvcc manual fails to explain what cmem[n] is.
It does say that:

“Used constant memory is partitioned in constant program ‘variables’ (bank 1), plus
compiler generated constants (bank 14).”

relative to cmem[1] and cmem[14], but nothing more is said.

So my question is what is cmem[0], cmem[2] and cmem[16], or any other value…

And does it refer to constant memory per kernel, thread, cuda core or block?

Best I can tell from looking at disassembled machine code, for sm_2x devices the constant banks appear to be assigned as follows:

cmem[0] kernel arguments
cmem[2] user defined constant objects
cmem[16] compiler generated constants (some of which may correspond to literal constants in the source code)

As indicated by the discrepancy with the documented assignments, these assignments are internal implementation details that can and will change over time. I will contact the toolchain team about a refresh of the documentation in this reagrd. Thank you for bringing this to our attention.

thank you for your answer.

This raises another question, since, as I understand, cmem can’t possibly be responsible for my cuda error message (too many resources requested for launch).

I made a method to compute the number of threads based on:
.maximum number of registers per block
.maximum number of threads
.maximum block dimensions
.maximum grid dimension

and I made sure that none of these is violated.
Is there any other possible source for the error?

As you can see from my ptxas info, I’m not using any shared memory.

If you are getting that at launch, and not using shared memory, then it must be registers. The only other issue is excessive local memory, but that will generate a compile time error, not runtime error. Which if those kernels above is failing, and what block size are you using when it fails?

It does not give any compile time error.

The one giving error is the kernel with 38 registers and the block size is 4x4x50

3844*50 = 30400 < 32768

so no problem there (I think)

Register usage looks OK too. Given this is on Fermi, a wild guess might be to try playing with cudaThreadSetLimit/cudaThreadGetLimit. It might be a lack heap or fifo space, you might need to check how much free memory is available too. Otherwise I can’t see anything obviously wrong based on what you have posted.

Yes, I’m using a Fermi card (GTX 460)

Using the occupancy calculator I found out that using these values for block, registers and shared memory, there was a RED flag on these:

Maximum Thread Blocks Per Multiprocessor - Blocks
Limited by Max Warps / Blocks per Multiprocessor - 1
Limited by Registers per Multiprocessor - 1

Could that be the problem? How can I overcome this?

That doesn’t mean there is a problem. If the “Limited by Max Warps / Blocks per Multiprocessor” were 0, it would indicate you had too many threads per block for the number of registers per thread the kernel requires.

I think you exceeded maximum number of threads. 4x4x50=800>768.

Actually the maximum number of threads for my card is 1024