Where can i find the information regarding the --ptxas-options=-v output ?

Does the Nvidia Cuda documentation contains any detail info about the above command output ? if yes, kindly refer me the link because i couldn’t find the above information.

I have some question regarding the following outputs

ptxas info    : 1024 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _Z9CRC32_NEWjh
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

The above is the output of this file

#include "include/crc32_new.h"

static  __device__ uint32_t crc32_tab[256] = {
	0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
	0xe963a535, 0x9e6495a3,	0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
	0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
	0xf3b97148, 0x84be41de,	0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
	0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec,	0x14015c4f, 0x63066cd9,
	0xfa0f3d63, 0x8d080df5,	0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
	0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b,	0x35b5a8fa, 0x42b2986c,
	0xdbbbc9d6, 0xacbcf940,	0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
	0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
	0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
	0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d,	0x76dc4190, 0x01db7106,
	0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
	0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
	0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
	0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
	0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
	0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
	0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
	0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
	0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
	0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
	0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
	0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
	0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
	0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
	0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
	0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
	0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
	0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
	0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
	0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
	0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
	0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
	0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
	0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
	0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
	0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
	0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
	0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
	0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
	0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
	0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
	0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};


__host__ __device__ uint32_t CRC32_NEW(uint32_t incrc, uint8_t b)
{
    return crc32_tab[(incrc ^ b) & 0xFF] ^ (incrc >> 8);

}
  1. It is obvious that gmem is 1024 (264x4) Why is constant memory 8bytes in the above case ? What is meant by the number [14]?

  2. I have a project with multiple .cu files and would like to optimize it. While considering optimization, ptxas information is different for kernel.cu, main.cu and crc32_new.cu. How should i consider these values.

ptxas info kernel.cu and main.cu is the following

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z28MyKernelhS0_S0_PK3CDRS0_Ph' for 'sm_20'
ptxas info    : Function properties for _Z28MyKernelhS0_S0_PK3CDRS0_Ph
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 80 bytes cmem[0], 4 bytes cmem[16]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z14float_to_colorPhPKf' for 'sm_20'
ptxas info    : Function properties for _Z14float_to_colorPhPKf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
ptxas info    : Function properties for _Z5valueffi
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Compiling entry function '_Z14float_to_colorP6uchar4PKf' for 'sm_20'
ptxas info    : Function properties for _Z14float_to_colorP6uchar4PKf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 48 bytes cmem[0], 4 bytes cmem[16]

Constant memory, referred to as cmem in the above, is organized into various banks, which are identified by subscript. The banks available, their size, and their functional assignment are generally architecture specific. In other words, these are implementation artifacts.

If you spend enough time looking at the output of cuobjdump --dump-sass, you should be able to reverse engineer the usage for specific constant banks on a particular architecture: kernel arguments, kernel configuration parameters, programmer defined constant data, compiler generated constant data, etc.

If the impression I gather from your post in these forums is accurate, you are an early-stage CUDA learner. If that assessment is correct, I would suggest ignoring this PTXAS output for now, and to rely on the CUDA profiler to point you at the bottlenecks in your code.

1 Like

You are right. I am noob in cuda and would like to explore it with the help of you guys.

The profiler can point out the bottlenecks but i would like to know the amount of Regs,shared mem and other resources use by my program in-order to tune various parameters in my project. Actually i had build a project and now i’m considering the possibilities to optimize it.

I’m confused with the different values for the same parameter in my ptxas information.
E.g. As you can see from my above post, The ptxas info of my kernel.cu shows that it uses 29 registers, similarly the ptxas info of main.cu shows different values for consumed registers [which values to consider!]

1 Like

The registers per thread actually used is on a per-kernel-function basis. This will become complicated when you have device functions defined in multiple compilation units that are used cross-compilation unit. In that case the compiler determines (sometimes/usually) the registers per thread usage on a per-function basis. You have to combine them.

You have different outputs for main.cu and kernel.cu, because those two files have different functions defined within them. The functions in question are indicated by the C++ mangled name. (Please google C++ mangling if you don’t know what that is).

In one module you have something like MyKernel defined.

In another module you have something like two different versions of float_to_color defined.

(This is a lot easier to explain if you provide a short, complete example).
The registers-per-thread computation is done on a per-function per-compilation unit basis. As I said already, you would have to sensibly combine these. Alternatively, the profiler will report registers-per-thread used by an actual kernel launch, which may be easier if working in this kind of multi-module device-linked code environment.

The rules for combination of registers per thread usage for a kernel that is calling multiple device functions may become fairly complex or obscure. Therefore, in complex scenarios, I would encourage you to use the profiler instead of the ptxas output, to start your analysis of registers per thread impact on performance/optimization, pretty much just as suggested by njuffa. The ptxas output is much more useful for relative comparisons when making code changes to see how they affect registers per thread usage.

Note that the profiler will also tell you how much shared memory is used/required by a kernel call.

1 Like

Thank you, that was very helpful.

1 Like

I know spill is bad. But how about cmem? Is it the less the better? Thanks!

cmem is constant memory. It might be used explicitly in a program, like if you specify __constant__. It can also be used implicitly to pass kernel parameters from host to device during the kernel launch process.

I don’t know why it would be “bad” to use cmem.

1 Like

Well, you mentioned “implicitly passed to kernel”. So if it is slow, maybe we can somehow reduce this implicit way…

Unless you have a kernel with zero parameters defined, constant memory will be used to pass kernel arguments. I don’t think this is a sensible or practical idea. If you are passing a large amount of kernel parameter data, then there could be an impact, but in my experience it’s not a useful optimization target.

constant memory is cached and typically the same values over all lanes/threads of a warp. Constant memory has similar performance as loading the instructions of the program. If the same constants are loaded within a loop, loading them once does not matter much. If they are used only once and you have not enough threads running at the same time, latency issues can arise.