Where can i find the information regarding the --ptxas-options=-v output ?

tajiknomi · January 4, 2019, 11:26am

Does the Nvidia Cuda documentation contains any detail info about the above command output ? if yes, kindly refer me the link because i couldn’t find the above information.

I have some question regarding the following outputs

ptxas info    : 1024 bytes gmem, 8 bytes cmem[14]
ptxas info    : Function properties for _Z9CRC32_NEWjh
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

The above is the output of this file

#include "include/crc32_new.h"

static  __device__ uint32_t crc32_tab[256] = {
	0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f,
	0xe963a535, 0x9e6495a3,	0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988,
	0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2,
	0xf3b97148, 0x84be41de,	0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7,
	0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec,	0x14015c4f, 0x63066cd9,
	0xfa0f3d63, 0x8d080df5,	0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172,
	0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b,	0x35b5a8fa, 0x42b2986c,
	0xdbbbc9d6, 0xacbcf940,	0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59,
	0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423,
	0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924,
	0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d,	0x76dc4190, 0x01db7106,
	0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433,
	0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d,
	0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e,
	0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950,
	0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65,
	0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7,
	0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0,
	0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa,
	0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f,
	0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81,
	0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a,
	0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84,
	0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1,
	0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
	0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc,
	0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e,
	0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b,
	0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55,
	0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236,
	0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28,
	0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d,
	0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f,
	0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38,
	0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242,
	0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777,
	0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69,
	0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2,
	0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc,
	0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9,
	0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693,
	0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94,
	0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};


__host__ __device__ uint32_t CRC32_NEW(uint32_t incrc, uint8_t b)
{
    return crc32_tab[(incrc ^ b) & 0xFF] ^ (incrc >> 8);

}

It is obvious that gmem is 1024 (264x4) Why is constant memory 8bytes in the above case ? What is meant by the number [14]?
I have a project with multiple .cu files and would like to optimize it. While considering optimization, ptxas information is different for kernel.cu, main.cu and crc32_new.cu. How should i consider these values.

ptxas info kernel.cu and main.cu is the following

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z28MyKernelhS0_S0_PK3CDRS0_Ph' for 'sm_20'
ptxas info    : Function properties for _Z28MyKernelhS0_S0_PK3CDRS0_Ph
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 80 bytes cmem[0], 4 bytes cmem[16]

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z14float_to_colorPhPKf' for 'sm_20'
ptxas info    : Function properties for _Z14float_to_colorPhPKf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 20 registers, 48 bytes cmem[0], 4 bytes cmem[16]
ptxas info    : Function properties for _Z5valueffi
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Compiling entry function '_Z14float_to_colorP6uchar4PKf' for 'sm_20'
ptxas info    : Function properties for _Z14float_to_colorP6uchar4PKf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 21 registers, 48 bytes cmem[0], 4 bytes cmem[16]

njuffa · January 4, 2019, 7:54pm

Constant memory, referred to as cmem in the above, is organized into various banks, which are identified by subscript. The banks available, their size, and their functional assignment are generally architecture specific. In other words, these are implementation artifacts.

If you spend enough time looking at the output of cuobjdump --dump-sass, you should be able to reverse engineer the usage for specific constant banks on a particular architecture: kernel arguments, kernel configuration parameters, programmer defined constant data, compiler generated constant data, etc.

If the impression I gather from your post in these forums is accurate, you are an early-stage CUDA learner. If that assessment is correct, I would suggest ignoring this PTXAS output for now, and to rely on the CUDA profiler to point you at the bottlenecks in your code.

tajiknomi · January 5, 2019, 6:11am

You are right. I am noob in cuda and would like to explore it with the help of you guys.

The profiler can point out the bottlenecks but i would like to know the amount of Regs,shared mem and other resources use by my program in-order to tune various parameters in my project. Actually i had build a project and now i’m considering the possibilities to optimize it.

I’m confused with the different values for the same parameter in my ptxas information.
E.g. As you can see from my above post, The ptxas info of my kernel.cu shows that it uses 29 registers, similarly the ptxas info of main.cu shows different values for consumed registers [which values to consider!]

Robert_Crovella · January 5, 2019, 2:10pm

The registers per thread actually used is on a per-kernel-function basis. This will become complicated when you have device functions defined in multiple compilation units that are used cross-compilation unit. In that case the compiler determines (sometimes/usually) the registers per thread usage on a per-function basis. You have to combine them.

You have different outputs for main.cu and kernel.cu, because those two files have different functions defined within them. The functions in question are indicated by the C++ mangled name. (Please google C++ mangling if you don’t know what that is).

In one module you have something like MyKernel defined.

In another module you have something like two different versions of float_to_color defined.

(This is a lot easier to explain if you provide a short, complete example).
The registers-per-thread computation is done on a per-function per-compilation unit basis. As I said already, you would have to sensibly combine these. Alternatively, the profiler will report registers-per-thread used by an actual kernel launch, which may be easier if working in this kind of multi-module device-linked code environment.

The rules for combination of registers per thread usage for a kernel that is calling multiple device functions may become fairly complex or obscure. Therefore, in complex scenarios, I would encourage you to use the profiler instead of the ptxas output, to start your analysis of registers per thread impact on performance/optimization, pretty much just as suggested by njuffa. The ptxas output is much more useful for relative comparisons when making code changes to see how they affect registers per thread usage.

Note that the profiler will also tell you how much shared memory is used/required by a kernel call.

tajiknomi · January 7, 2019, 4:27am

Thank you, that was very helpful.

202476410arsmart · March 18, 2024, 9:39am

I know spill is bad. But how about cmem? Is it the less the better? Thanks!

Robert_Crovella · March 18, 2024, 1:13pm

cmem is constant memory. It might be used explicitly in a program, like if you specify __constant__. It can also be used implicitly to pass kernel parameters from host to device during the kernel launch process.

I don’t know why it would be “bad” to use cmem.

202476410arsmart · March 19, 2024, 11:12am

Well, you mentioned “implicitly passed to kernel”. So if it is slow, maybe we can somehow reduce this implicit way…

Robert_Crovella · March 19, 2024, 12:46pm

Unless you have a kernel with zero parameters defined, constant memory will be used to pass kernel arguments. I don’t think this is a sensible or practical idea. If you are passing a large amount of kernel parameter data, then there could be an impact, but in my experience it’s not a useful optimization target.

Curefab · March 23, 2024, 10:57pm

constant memory is cached and typically the same values over all lanes/threads of a warp. Constant memory has similar performance as loading the instructions of the program. If the same constants are loaded within a loop, loading them once does not matter much. If they are used only once and you have not enough threads running at the same time, latency issues can arise.

Topic		Replies	Views
nvcc info CUDA Programming and Performance	5	8137	December 3, 2008
ptxas info - help CUDA Programming and Performance	3	14456	March 20, 2008
Info regarding NVCC flag --ptxas-options="-v" CUDA Programming and Performance	3	10895	February 3, 2009
--ptxas-options=-v question CUDA Programming and Performance	3	9191	May 15, 2008
Constant Memory CUDA Programming and Performance	0	1972	September 28, 2009
Decoding Fermi PTXinfo Legacy PGI Compilers	2	2478	December 9, 2011
ptxas info , why so many lines? CUDA Programming and Performance	4	2047	April 30, 2014
A meaning of nvlink warning: Stack size for entry function cannot be statically determined CUDA Programming and Performance	2	2235	December 11, 2012
How to use --ptxas output With respect to optimization CUDA Programming and Performance	3	2101	August 24, 2009
ptxas -v: What's the smem after the plus sign? "ptxas info: 3256+3224 bytes smem" CUDA Programming and Performance	2	5851	July 27, 2010

Where can i find the information regarding the --ptxas-options=-v output ?

Related topics