understanding the number of registers

I used --cubin option and printed the register number for each thread, I then count the defined register variables in my kernel, they roughly match.

however, in my code, I have many places where the RHS requires a bunch of floating-point operations, such as

ctheta=tmp0*cphi*cphi+tmp1*len;

I imagine these operations also require temporary registers to hold the intermediate results when evaluating the RHS. My question is: is the register count reported in the cubin file include these temp registers? or, in other words, will these temp registers consume the 8192 register limit?

the reason I ask this is because I am getting “the launch timed out and was terminated” error when the thread number is set to a bigger value. From what I searched online, I think this is related to register limit. Anyone want to share your experience on this?

thank you in advance

one additional question, how can I get cubin file when I use atomic operations in the cu file?

nvcc --cubin -arch compute_11 -DFAST_MATH mcextreme.cu

when I use the above command, nvcc complained that

nvcc fatal   : Option '-cubin' is not allowed when compiling for a virtual compute architecture

but I if I drop -arch compute_11, nvcc will complain that atomicXXX is not defined :(

“the launch timed out and was terminated” error occurs when your program run too slow,
as far as i know
OS: window xp, maximum for your program runs without timeout occurs is 5sec.
OS: window vista: maximum for your program runs without timeout occurs is 3sec.
OS: linux no limited.

if you want to get to know how many register your program has used, you have many kind to do it.

  1. using cuda visual profiler
  2. using “–keep” flag in compiler command, you will get the *.cubin file when build your program, inside this *.cubin file you can see the number of your program has used.
  3. using “–ptxas-options=-v” flag in compiler command, you will see the number of registers your program has used when compile this *.cu file.

I ran my program on Linux (CentOS 5.3) and received timed-out error when my kernel runs more than 10second or so.

Is this limitation imposed by operating system or by nvidia drivers/CUDA? My application is scientific computing and more than 10 sec is very common. Is there a way to get around this limitation?

thanks for the tips. unfortunately, I tried method 2 and 3, whenever I used atomicExch in the code and -arch compute_11 option, --keep will not produce cubin, neither does --ptxas-options=-v report register number.

timed-out error
I really don’t know why you get time-out error when running on Linux. In my experiments, my program runned too slow (more than 30 minutes) but time-out error didn’t occur.
by the way, did you use your nvidia gpu card for two purposes? (display and computing).

registers.
I have used 3 methods above and all of them work perfectly (on window XP), before I have never get the number of register when my programs executed on Linux so I hope someone kindly to give you some suggestions.

I used the second method
file was generated.
it
myprogram.sm_10.cubin

its --------------------------------------

architecture {sm_10}
abiversion {1}
modname {cubin}
code {
name = _Z5multiPiS_S_jjji
lmem = 0
smem = 44
reg = 11
bar = 1
bincode {
0x10004205 0x0023c780 0xa0000005 0x04000780
0x60014c09 0x00204780 0x3002d5fd 0x6c20c7c8
0xa002c003 0x00000000 0x1002c003 0x00000280
0x307cd5fd 0x6c20c7c8 0x1002c003 0x00000280
0x1100ee00 0x1100f204 0x40020a10 0x4005000c
0x60030811 0x00010780 0x6004020d 0x0000c780
0x30100811 0xc4100780 0x3010060d 0xc4100780
0x60020805 0x00010780 0x60040001 0x0000c780
0x3002d409 0xc4300780 0x2101ec18 0x2100e800
0xa002b003 0x00000000 0x1100ea04 0x20028c1c
0x20000021 0x04008780 0x10008009 0x00000003
0xd00e0c09 0xa0c00780 0x1000f825 0x0403c780
0x10008208 0x1000800c 0xa0026003 0x00000000
0xd00e0411 0x80c00780 0xd00e0615 0x80c00780
0x400b1029 0x00000780 0x600a1229 0x00028780
0x30101429 0xc4100780 0x600a1011 0x00028780
0x2004860d 0x00000003 0x20000825 0x04024780
0x300807fd 0x640147c8 0x2000d009 0x04208780
0xd00e0c25 0xa0c00780 0x1001a003 0x00000280
0xf0000001 0xe0000002 0x20048c19 0x00000003
0x30070dfd 0x640147c8 0x20048205 0x00000003
0x10015003 0x00000280 0xf0000001 0xe0000002
0xf0000001 0xe0000002 0x861ffe03 0x00000000
0xf0000001 0xe0000001
}
}

now what these this mean, how can i analyze it

Do you have X running? Linux has a watchdog timer if X is running.

the recent deviceQuery samples in SDK 2.2.x include some code to query if the watchdog timer is active.

Are you accessing arrays in an extremely uncoalesced way? I don’t quite see why the operations you posted should take so long. Doublecheck that you’re not accidentally running this in an <<<1,1>>> launch configuration :)

In my experience, the register count reported by the nvcc flag “–ptxas-options=-v” is very accurate.