call stack/ stack frame

little_jimmy · October 8, 2014, 2:20pm

and so i cleaned up the stack frame - significantly

however, in its almighty wisdom, the compiler still refuses extra blocks to be seated

i am confident that it is not due to shared memory, but still due to local memory - i hardly spill anymore, and when i do, it is no more than 16 bytes

exactly how does on interpret the ptxas information on register usage? is the noted value again per thread?

ptxas info : Function properties for Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 180 registers, 507 bytes smem, 688 bytes cmem[0]

if so, the value seems excessive, to say the least
is there a way to obtain a more detailed breakdown of the register use build-up, other than to completely strip the kernel, and to systematically put its components back again, noting changes in reported register use with each step?

Robert_Crovella · October 8, 2014, 2:29pm

Yes, the register use reported by ptxas is per thread. To get total register use by the kernel launch, you have to multiply by the number of threads launched, possibly also taking into account register allocation granularity.

little_jimmy · October 8, 2014, 2:44pm

thanks, txbob

i believe 180 registers per thread is exorbitant; the kernel itself is basically a shell, calling/ controlling a number of functions; and i really can not think that any function would use that much registers per turn; perhaps the kernel/ functions are ‘clinging’ on to registers

is there a way to track this? otherwise i would have to start with an empty kernel shell, and add its functions one by one, to note register use/ accumulation

Robert_Crovella · October 8, 2014, 2:53pm

I agree it sounds quite high. I don’t know how to tease out some kind of allocated breakdown of register usage; perhaps others will (other than by just inspecting SASS, which is pretty manual). However you seem to be focused on occupancy. Register usage can have a major effect on occupancy. AFAIK spilling by itself does not (although it can affect perf, obviously.) You might want to validate your compass heading, by first just using launch bounds to limit register usage (acknowledging the probability of spills) and see if you can drive up occupancy.

If you can, then there may be some value in attacking register usage. I’m kind of surprised that you can get 8 threadblocks “seated” while still having a register usage of 180 per thread. You must not be launching very many threads per block (my arithmetic says less than 64).

As a final suggestion, I do think that performance analysis-driven optimization is the way to go, instead of blindly attacking register usage, or occupancy, or any other specific metric. There are good presentations on the basics of analysis driven optimization.

little_jimmy · October 8, 2014, 3:13pm

“AFAIK spilling by itself does not”

i was under the impression that, to put it in my own words, the compiler sees spilling into L1 as an orange light, and spilling into global as a red light…
the propensity to squeeze out more performance should drop with an increase in spilling, in the eyes of the compiler

yes, i only seat blocks of 32 threads

this might sound awkward, but, in addition to parallel implementation, my problem solves faster the more instances are processed concurrently, because it impacts arrival rates in short
hence, at present i obtain 120 instances, and i feel semi accomplished

to be honest, the profiler behaves much like memtest with this particular kernel - simply put, it gets lost…

thanks for the advice

njuffa · October 8, 2014, 3:47pm

Re: the kernel itself is basically a shell, calling/ controlling a number of function

If those device functions reside in the same compilation unit as the kernel itself, they will likely get inlined into the kernel which can create very large code and a lot of register pressure due to the many variables used across the totality of that code. Again, this is speculation as to the cause of the high register pressure since you have not provided any of the code in question. Note that registers are 32 bit wide, so any 64-bit data objects, e.g. of type double, long long int, or size_t will take up two registers each.

To explore the impact of inlining on register pressure, you could experiment with the noinline function attribute, or try to compile each function separately. I second txbob’s recommendation to use the profiler for analysis-driven optimization instead of trying to guess what the performance limiters are in this code.

little_jimmy · October 9, 2014, 2:18pm

i am doing something wrong, missing something, or a combination of the 2…

firstly, noinline does nothing whatsoever to the register usage

secondly, this is the kernel as shell, stripped of all the device functions:

__global__ void _fwd_krnl(bool in_have_ref_case, bool gbl_pnts_rem,
	unsigned int cur_krnl_n, unsigned int comb_arr_size, unsigned int iss_mul_comb_arr_size,
	unsigned int comb_pnt_n, unsigned int tot_block_bnd, unsigned int gbl_pnt_cnt,
	unsigned int block_pnt_cnt_trans_pnt, unsigned int block_pnt_cnt_trans_pnt_block_n,
	unsigned int block_pnt_cntA, unsigned int block_pnt_cntB,
	double in_ref_case, double port_val, double max_fwd, double comb_res,
	bool* d_in_pnt, bool* d_block_bnd_has_range, bool* d_block_posted_ref_case, bool* d_gbl_posted_ref_case,
	bool* d_gbl_set_ref, bool* d_stop_exec,
	char* d_gbl_sol_overflow_type, char* d_block_stat,
	unsigned int* d_pnts_cnt, unsigned int* d_rem_in_val, unsigned int* d_in_pnt_val,
	unsigned int* d_port_pnt_iss_mul_data, unsigned int* d_iss_mul_data, unsigned int* d_iss_port_pnt_in_arr,
	unsigned int* d_1st_ref_upd, unsigned int* d_gbl_block_sol_cnt, unsigned int* d_terminate,
	double* d_fwd, double* d_comb, double* d_start_pnt_range_data, double* d_port_pnt_time,
	double* d_port_pnt_time_diff, double* d_block_start_time_diff_rec, double* d_iss_cpn_in_arr,
	double* d_iss_val, double* d_iss_comb_sum, double* d_port_pnt_mul_sum, double* d_gbl_ref_case,
	double* d_block_ref_case, double* d_best_result_stack, double* d_sol_stack, double* d_worst_ref_case,
	double* d_result_upd, double* d_split_comb, double* d_split_fwd, double* d_split_work_data)
{
	__shared__ bool break_for_overhead;
	__shared__ bool must_do_overhead;
	__shared__ bool must_terminate;
	__shared__ unsigned int rerun_cnt;

	__shared__ unsigned int cur_pnt_port_pnt[3];
	__shared__ double cur_pnt_fwd[3];
	__shared__ double start_pnt_range_data[start_pnt_range_data_length];

	__shared__ Fwd_krnl_data fwd_krnl_data;

	if (threadIdx.x < start_pnt_range_data_length)
	{
		start_pnt_range_data[threadIdx.x] = d_start_pnt_range_data[threadIdx.x];
	}

	__syncthreads();
}

and this the ptxas information for it, followed by the original kernel’s information:

ptxas info : Compiling entry function ‘Z9_fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2’ for ‘sm_35’
ptxas info : Function properties for Z9_fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
360 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 97 registers, 16 bytes smem, 688 bytes cmem[0]
ptxas info : Compiling entry function ‘Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2’ for ‘sm_35’
ptxas info : Function properties for Z8fwd_krnlbbjjjjjjjjjjddddPbS_S_S_S_S_PcS0_PjS1_S1_S1_S1_S1_S1_S1_S1_PdS2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 180 registers, 507 bytes smem, 688 bytes cmem[0]

the shell itself is already at 97 registers…?

little_jimmy · October 9, 2014, 2:27pm

97 registers would be more or less the amount of registers required to store the kernel parameters…

but those are passed via constant memory, not so…? so why would they end up in registers already…?

little_jimmy · October 10, 2014, 9:56am

by now i have not only rolled up the within kernel device functions’ function parameters, but also the kernel’s function parameters

ptxas information reports register use down from 180 to 35 - 41:

ptxas info : Function properties for Z8fwd_krnljPjPdPPbPPcPS_PS0
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 35 registers, 951 bytes smem, 376 bytes cmem[0]

i now manage to seat 210 blocks concurrently across the device, up from 120

i shall later on attempt to more thoroughly profile, to see what is optimal loading

thanks txbob, njuffa, gregory for your input

Topic		Replies	Views
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3259	January 20, 2019
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	329	April 3, 2024
Where can i find the information regarding the --ptxas-options=-v output ? CUDA Programming and Performance	9	3046	March 23, 2024
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7276	December 4, 2011
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30287	December 15, 2010
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	88	December 7, 2024
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3181	June 3, 2015
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1387	March 25, 2022
Anyway to reduce stack frame usage CUDA Programming and Performance	8	1360	November 29, 2018
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29325	April 23, 2010

call stack/ stack frame

Related topics